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Abstract 



The canonical problem of solving a system of linear equations arises in numerous contexts in 
information theory, communication theory, and related fields. In this contribution, we develop 
a solution based upon Gaussian belief propagation (GaBP) that does not involve direct matrix 
inversion. The iterative nature of our approach allows for a distributed message-passing imple- 
mentation of the solution algorithm. In the first part of this thesis, we address the properties of 
the GaBP solver. We characterize the rate of convergence, enhance its message-passing efficiency 
by introducing a broadcast version, discuss its relation to classical solution methods including nu- 
merical examples. We present a new method for forcing the GaBP algorithm to converge to the 
correct solution for arbitrary column dependent matrices. 

In the second part we give five applications to illustrate the applicability of the GaBP algorithm 
to very large computer networks: Peer-to-Peer rating, linear detection, distributed computation 
of support vector regression, efficient computation of Kalman filter and distributed linear pro- 
gramming. Using extensive simulations on up to 1,024 CPUs in parallel using IBM Bluegene 
supercomputer we demonstrate the attractiveness and applicability of the GaBP algorithm, using 
real network topologies with up to millions of nodes and hundreds of millions of communication 
links. We further relate to several other algorithms and explore their connection to the GaBP 
algorithm. 
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Chapter 1 
Introduction 



Solving a linear system of equations Ax = b is one of the most fundamental problems in algebra, 
with countless applications in the mathematical sciences and engineering. In this thesis, we 
propose an efficient distributed iterative algorithm for solving systems of linear equations. 

The problem of solving a linear system of equations is described as follows. Given an ob- 
servation vector b G and the data matrix A G M'"^" [m > n E Z), a unique solution, 
X = X* G M", exists if and only if the data matrix A has full column rank. Assuming a nonsin- 
gular matrix A, the system of equations can be solved either directly or in an iterative manner. 
Direct matrix inversion methods, such as Gaussian elimination (LU factorization, [1]-Ch. 3) or 
band Cholesky factorization ( [1]-Ch. 4), find the solution with a finite number of operations, 
typically, for a dense n x n matrix, of the order of n^. The former is particularly effective for 
systems with unstructured dense data matrices, while the latter is typically used for structured 
dense systems. 

Iterative methods [2] are inherently simpler, requiring only additions and multiplications, and 
have the further advantage that they can exploit the sparsity of the matrix A to reduce the 
computational complexity as well as the algorithmic storage requirements [3]. By comparison, for 
large, sparse and amorphous data matrices, the direct methods are impractical due to the need 
for excessive matrix reordering operations. 

The main drawback of the iterative approaches is that, under certain conditions, they converge 
only asymptotically to the exact solution x* [2]. Thus, there is the risk that they may converge 
slowly, or not at all. In practice, however, it has been found that they often converge to the exact 
solution or a good approximation after a relatively small number of iterations. 

A powerful and efficient iterative algorithm, belief propagation (BP, [4]), also known as the 
sum-product algorithm, has been very successfully used to solve, either exactly or approximately, 
inference problems in probabilistic graphical models [5]. 

In this thesis, we reformulate the general problem of solving a linear system of algebraic 
equations as a probabilistic inference problem on a suitably-defined graph-^. Furthermore, for 

■"^Recently, we have found out the work of Moallemi and Van Roy [6] which discusses the connection between 
the Min-Sum message passing algorithm and solving quadratic programs. Both works [7, 6] were published in 
parallel, and the algorithms where derived independently, using different techniques. In Section 11.4 we discuss 



1 



CHAPTER 1. INTRODUCTION 



the first time, a full step-by-step derivation of the GaBP algorithm from the belief propagation 
algorithm is provided. 

As an important consequence, we demonstrate that Gaussian BP (GaBP) provides an effi- 
cient, distributed approach to solving a linear system that circumvents the potentially complex 
operation of direct matrix inversion. Using the seminal work of Weiss and Freeman [8] and some 
recent related developments [9,10,6], we address the convergence and exactness properties of 
the proposed GaBP solver. 

This thesis is structured as follows Chapter 2 introduces the GaBP by providing a clean step- 
by-step derivation of the GaBP algorithm by substituting gaussian probability into Pearl's belief 
propagation update rules. 

Starting from Chapter 3, we present our novel contributions in this domain. Chapter 3 presents 
our novel broadcast version of GaBP that reduces the number of unique messages on a dense graph 
from O(ri^) to 0{n). This version allows for efficient implementation on communication networks 
that supports broadcast such as wireless and Ethernet. (Example of an efficient implementation 
of the broadcast version on top of 1,024 CPUs is reported in Chapter 8). We investigate the use 
of acceleration methods from linear algebra to be applied for GaBP. We compare methodically 
GaBP algorithm to other linear iterative methods. Chapter 3 further provides theoretical analysis 
of GaBP convergence rate assuming diagonally dominant inverse covariance matrix. This is the 
first time convergence rate of GaBP is characterized. 

In Chapter 4 we give numerical examples for illustrating the convergence properties of the 
GaBP algorithm. 

The GaBP algorithm, like the linear iterative methods, has sufficient conditions for conver- 
gence. When those sufficient conditions do not hold, the algorithm may diverge. To address 
this problem. Chapter 5 presents our novel construction for forcing convergence of the GaBP 
algorithm to the correct solution, for positive definite matrices, as well as for column dependent 
non-square matrices. We believe that this result is one of the main novelties of this work, since 
it applies not only to the GaBP algorithm but to other linear iterative methods as well. 

The second part of this work is mainly concentrated with applications for the GaBP algorithm. 
The first application we investigate (Chapter 6) is the rating of nodes in a Peer-to-Peer network. 
We propose a unifying family of quadratic cost functions to be used in Peer-to-Peer ratings. We 
show that our approach is general, since it captures many of the existing algorithms in the fields of 
visual layout, collaborative filtering and Peer-to-Peer rating, among them Koren's spectral layout 
algorithm, Katz's method. Spatial ranking. Personalized PageRank and Information Centrality. 
Beside of the theoretical interest in finding common basis of algorithms that were not linked 
before, we allow a single efficient implementation for computing those various rating methods, 
using the GaBP solver. We provide simulation results on top of real life topologies including the 
MSN Messenger social network. 

In Chapter 7 we consider the problem of linear detection using a decorrelator in a code- 
division multiple-access (CDMA) system. Through the use of the iterative message-passing 
formulation, we implement the decorrelator detector in a distributed manner. This example 
allows us to quantitatively compare the new GaBP solver with the classical iterative solution 

the connection between the two algorithms, and show they are equivalent. 
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1.1. MATERIAL COVERED IN THIS THESIS 



methods that have been previously investigated in the context of a linear implementation of CDMA 
demodulation [11,12,13]. We show that the GaBP-based decorrelator yields faster convergence 
than these conventional methods. We further extend the applicability of the GaBP solver to 
non-square column dependent matrices. 

Third application from the field of machine learning is support vector regression, described 
in Chapter 8. We show how to compute kernel ridge regression using our GaBP solver. For 
demonstrating the applicability of our approach we used a cluster of IBM BlueGene computers 
with up to 1,024 CPUs in parallel on very large data sets consisting of millions of data points. 
Up to date, this is the largest implementation of belief propagation ever performed. 

Fourth application is the efficient distributed calculation of Kalman filter, presented in Chapter 
9. We further provide some theoretical results that link the Kalman filter algorithm to the Gaussian 
information bottleneck algorithm and the Affine-scaling interior point method. 

Fifth application is the efficient distributed solution of linear programming problems using 
GaBP, presented in Chapter 10. As a case study, we discuss the network utility maximization 
problem and show that our construction has a better accuracy than previous methods, despite the 
fact it is distributed. We provide a large scale simulation using networks of hundreds of thousands 
of nodes. 

Chapter 11 identifies a family of previously proposed algorithms, showing they are instances of 
GaBP. This provides the ability to apply the vast knowledge about GaBP to those special cases, 
for example applying sufficient conditions for convergence as well as applying the convergence fix 
presented in Chapter 5. 

1.1 Material Covered in this Thesis 

This thesis is divided into two parts. The first part discusses the theory of Gaussian belief 
propagation algorithm and covers the following papers: [7,14,15,16,17,18]. The second part 
discusses several applications that were covered in the following papers: [19,20,21,22,23,24,25]. 

Below we briefly outline some other related papers that did not fit into the main theme of 
this thesis. We have looked at belief propagation at the using discrete distribution as a basis for 
various distributed algorithms: clustering [26], data placement [27,28] Peer-to-Peer streaming 
media [29] and wireless settings [30]. The other major topic we worked on is Peer-to-Peer 
networks, especially content distribution networks (CDNs). The Julia content distribution network 
is described on [31,32]. A modified version of Julia using network coding [33]. Tulip is a Peer- 
to-Peer overlay that enables fast routing and searching [34]. The eMule protocol specification is 
found on [35]. An implementation of a distributed testbed on eight European clusters is found 
on [36]. 

1.2 Preliminaries: Notations and Definitions 

We shall use the following linear-algebraic notations and definitions. The operator {-j^ stands 
for a vector or matrix transpose, the matrix I„ is an n x n identity matrix, while the symbols {■}i 
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and {■}ij denote entries of a vector and matrix, respectively. Let M G R'^^'^ be a real symmetric 
square matrix and A G M'"^" be a real (possibly rectangular) matrix. Let 1 denotes the all ones 
vector. 

Definition 1 (Pseudoinverse). The Moore-Penrose pseudoinverse matrix of the matrix A, de- 
noted by , is defined as 

At 4 (A^A)"'a^. (1.1) 

Definition 2 (Spectral radius). The spectral radius of the matrix'Wl, denoted by p(]sJV), is defined 
to be the maximum of the absolute values of the eigenvalues of M, i.e. , 

p(M)4max(|A,|), (1.2) 

\<1<S 

where Ai, . . . are the eigenvalues of the matrix M. 
Definition 3 (Diagonal dominance). The matrix M. is 



1. weakly diagonally dominant if 

|Mii| > J]|M,,|,Vz, (1.3) 

2. strictly diagonally dominant if 

|M,,| >^|M,,|,V2, (1.4) 

3. irreducibly diagonally dominant //M is irreducible^ , and 

\Mu\>J2\Mij\,\^t, (1.5) 

with strict inequality for at least one i. 

Definition 4 (PSD). The matrix M is positive semi-definite (PSD) if and only if for all non-zero 
real vectors z G W\ 

z^Mz > 0. (1.6) 

Definition 5 (Residual). For a real vector G M", the residual, r = r(x) G M™, of a linear 
system is r = Ax — b. 

The standard norm of the residual, | |r| |p(p = 1,2, ... , oo), is a good measure of the accuracy 
of a vector x as a solution to the linear system. In our experimental study, the Frobenius norm 
{i.e. ,p = 2) per equation is used, ^||r||F = ^v^El^T^- 

^ A matrix is said to be reducible if there is a permutation matrix P such that PMP-^ is block upper triangular. 
Otherwise, it is irreducible. 
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1.3. PROBLEM FORMULATION 



Definition 6 (Operator Norm). Given a matrix M. the operator norm ||A/||p is defined as 

iMxIL 



||M||p^sup. 

XT^O ll^llp 

Definition 7. The condition number, k, of the matrix M is defined as 

Kp^||M||p||M-i||p. (1.7) 
For M being a normal matrix (i.e. , M^M = MM^j, the condition number is given by 

Xr 



«2 



X-mi.. 

where Xmax 3nd Xmin 3re the maximal and minimal eigenvalues ofWl, respectively. 



(1.8) 



Even though a system is nonsingular it could be ill-conditioned. Ill-conditioning means that a 
small perturbation in the data matrix A, or the observation vector b, causes large perturbations 
in the solution, x*. This determines the difficulty of solving the problem. The condition number 
is a good measure of the ill-conditioning of the matrix. The better the conditioning of a matrix 
the smaller the condition number. The condition number of a non-invertible (singular) matrix is 
set arbitrarily to infinity. 

Definition 8 (Graph Laplacian^ [37]). Given a matrix a weighted matrix A describing a graph 
with n nodes, the graph Laplacian L is a symmetric matrix defined as follows: 




deg{i) 



where deg{i) = YjjeN{i) ^i* degree of node i. 



Xj 



It can be further shown [37] that given the Laplacian L, it holds that x Lx = Xli<j "^iji^i 

\2 



1.3 Problem Formulation 

Let A G M™^" [m,n E W) be a full column rank, mxn real-valued matrix, with m > n, and 
let b G M."^ be a real-valued vector. Our objective is to efficiently find a solution x* to the linear 
system of equations Ax = b given by 

X* = A^b. (1.9) 

Assumption 9. The matrix A is square (i.e. , m = n) and symmetric. 

■^Note this definition is an extension of the Laplacian in the unweighed case, where all edges have weight 1. 
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For the case of square invertible matrices the pseudoinverse matrix is nothing but the data 
matrix inverse, i.e. , = A^^. For any linear system of equations with a unique solution, As- 
sumption 9 conceptually entails no loss of generality, as can be seen by considering the invertible 
system defined by the new symmetric (and PSD) matrix A'^nxm-^mxn ^ -^nxn and vector 
A'^nxmbmxi ^ ^nxi- However, this transformation involves an excessive computational com- 
plexity of 0{n^m) and 0{nm) operations, respectively. Furthermore, a sparse data matrix may 
become dense due to the transformation, an undesired property as far as complexity is concerned. 
Thus, we first limit the discussion to the solution of the popular case of square matrices. In 
Section 7.1.1 the proposed GaBP solver is extended to the more general case of linear systems 
with rectangular m x n full rank matrices. 
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Chapter 2 

The GaBP-Based Solver Algorithm 



In this section, we show how to derive the iterative, Gaussian BP-based algorithm that we propose 
for solving the linear system 



2.1 From Linear Algebra to Probabilistic Inference 

We begin our derivation by defining an undirected graphical model {i.e. , a Markov random 
field), Q, corresponding to the linear system of equations. Specifically, let Q = {X,£), where 
X \s a set of nodes that are in one-to-one correspondence with the linear system's variables 
X = {xi, . . . ,Xn}^, and where £^ is a set of undirected edges determined by the non-zero entries 
of the (symmetric) matrix A. 

We will make use of the following terminology and notation in the discussion of the GaBP 
algorithm. Given the data matrix A and the observation vector b, one can write explicitly 
the Gaussian density function p(x) ~ exp(— |x-^Ax + b^x), and its corresponding graph Q 
consisting of edge potentials ('compatibility functions') ipij and self potentials ('evidence') 0j. 
These graph potentials are determined according to the following pairwise factorization of the 
Gaussian function (2.2) 

n 

p(x) oc ]^</)i(xi) J]^ , (2.1) 

resulting in ipij{xi,Xj) = exp{—^XiAijXj) and = exp ( — |Ajjxf + fejXj). The edges set 

{i,j} includes all non-zero entries of A for which i > j. The set of graph nodes N(z) denotes 
the set of all the nodes neighboring the ith node (excluding node i). The set N(2)\j excludes 
the node j from N(i). 

Using this graph, we can translate the problem of solving the linear system from the algebraic 
domain to the domain of probabilistic inference, as stated in the following theorem. 
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2.1. LINEAR ALGEBRA TO INFERENCE 



Ax = b 

t 

Ax - b = 

t 

min(ix"^Ax — b"^x) 



max(— ix"^Ax + b"^x) 



maxexp(— ix-^Ax + b"^x) 



Figure 2.1: Schematic outline of the of the proof to Proposition 10. 



Proposition 10. The computation of the solution vectors* = A^^b is identical to the inference 
of the vector of marginal means ji = {jii, . . . ,/in} over the graph Q with the associated joint 
Gaussian probability density function p(x) ~ A/'(yU, A~^). 

Proof. Another way of solving the set of linear equations Ax — b = is to represent it by 
using a quadratic form g(x) = x^Ax/2 — b-^x. As the matrix A is symmetric, the derivative of 
the quadratic form w.r.t. the vector x is given by the vector dq/dx = Ax — b. Thus equating 
dq/dx = gives the global minimum x* of this convex function, which is nothing but the desired 
solution to Ax = b. 

Next, one can define the following joint Gaussian probability density function 

p(x) = Z-^ exp ( - g(x)) = Z'^ exp (-x^Ax/2 + b^x), (2.2) 

where Z is a distribution normalization factor. Denoting the vector fi = A^^b, the Gaussian 
density function can be rewritten as 

p(x) = Z"^ exp (/i^A/i/2) 

X exp (-x^Ax/2 + yU^Ax - /i^A/i/2) 

= r'exp(-(x-/i)^A(x-/i)/2) 

= Ar(/i,A-i), (2.3) 

where the new normalization factor ( = Zexp (— /i^A/i/2). It follows that the target solu- 
tion X* = A^^b is equal to i^i = A^^b, the mean vector of the distribution p{x), as defined 
above (2.2). 

Hence, in order to solve the system of linear equations we need to infer the marginal densities, 
which must also be Gaussian, p{xi) J\f{iXi = {A^^bjj, i^"^ = {A~^}ji), where fii and Pi are 
the marginal mean and inverse variance (sometimes called the precision), respectively. □ 
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According to Proposition 10, solving a deterministic vector- matrix linear equation translates 
to solving an inference problem in the corresponding graph. The move to the probabilistic domain 
calls for the utilization of BP as an efficient inference engine. 

Remark 11. Defining a jointly Gaussian probability density function, immediately yields an im- 
plicit assumption on the positive semi-definiteness of the precision matrix A, in addition to the 
symmetry assumption. However, we would like to stress out that this assumption emerges only 
for exposition purposes, so we can use the notion of 'Gaussian probability', but the derivation 
of the GaBP solver itself does not use this assumption. See the numerical example of the exact 
GaBP-based solution of a system with a symmetric, but not positive semi-definite, data matrix 
A in Section 4.2. 

2.2 Belief Propagation 

Belief propagation (BP) is equivalent to applying Pearl's local message-passing algorithm [4], 
originally derived for exact inference in trees, to a general graph even if it contains cycles (loops). 
BP has been found to have outstanding empirical success in many applications, e.g. , in decoding 
Turbo codes and low-density parity-check (LDPC) codes. The excellent performance of BP in 
these applications may be attributed to the sparsity of the graphs, which ensures that cycles in 
the graph are long, and inference may be performed as if it were a tree. 

The BP algorithm functions by passing real-valued messages across edges in the graph and 
consists of two computational rules, namely the 'sum-product rule' and the 'product rule'. In 
contrast to typical applications of BP in coding theory [38], our graphical representation resembles 
to a pairwise Markov random field [5] with a single type of propagating messages, rather than 
a factor graph [39] with two different types of messages, originated from either the variable 
node or the factor node. Furthermore, in most graphical model representations used in the 
information theory literature the graph nodes are assigned with discrete values, while in this 
contribution we deal with nodes corresponding to continuous variables. Thus, for a graph Q 
composed of potentials tpij and (f)i as previously defined, the conventional sum-product rule 
becomes an integral-product rule [8] and the message mij{xj), sent from node i to node j over 
their shared edge on the graph, is given by 

mij{xj) (X j il)ij{xi,Xj)(t)i{xi) J]^ mki{xi)dxi. (2.4) 

fc6N(i)\j 

The marginals are computed (as usual) according to the product rule 

p{xi) = a(j)i{xi) Y\ ^ki{xi), (2.5) 

fcGN(j) 

where the scalar « is a normalization constant. Note that the propagating messages (and the 
graph potentials) do not have to describe valid [i.e. , normalized) density probability functions, 
as long as the inferred marginals do. 
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2.3 The Gaussian BP Algorithm 

Gaussian BP is a special case of continuous BP, where the underlying distribution is Gaussian. 
The GaBP algorithm was originally introduced by Weiss et al. [8]. Weiss work do not detail the 
derivation of the GaBP algorithm. We believe that this derivation is important for the complete 
understanding of the GaBP algorithm. To this end, we derive the Gaussian BP update rules by 
substituting Gaussian distributions into the continuous BP update equations. 

Given the data matrix A and the observation vector b, one can write explicitly the Gaussian 
density function, p(x) ~ exp(— ix^Ax + b^x), and its corresponding graph Q. Using the graph 
definition and a certain (arbitrary) pairwise factorization of the Gaussian function (2.3), the edge 
potentials ('compatibility functions') and self potentials ('evidence') (pi are determined to be 

ipij{xi,Xj) = exp^-^XiAijXj), (2.6) 
(j)i{xi) = exp {- lAiix'^i +biXi), (2.7) 

respectively. Note that by completing the square, one can observe that 

oc ^^{^^u = h/Au, Pu' = A^')- (2.8) 

The graph topology is specified by the structure of the matrix A, i.e. the edges set includes 
all non-zero entries of A for which i > j. 

Before describing the inference algorithm performed over the graphical model, we make the 
elementary but very useful observation that the product of Gaussian densities over a common 
variable is, up to a constant factor, also a Gaussian density. 

Lemma 12. Let fi{x) and f2{x) be the probability density functions of a Gaussian random vari- 
able with two possible densities M {ni, Pi^) and M{fX2, P2^), respectively. Then their product, 
f{x) = /i(x)/2(x) is, up to a constant factor, the probability density function of a Gaussian 
random variable with distribution J\f{fj., P~^) , where 



Proof. Taking the product of the two Gaussian probability density functions 



/i(x)/2(x) = exp ( - {P,ix - + P^ix - /X2)^)/2) 

and completing the square, one gets 

= ^ exp ( - P(x - /i)V2), 



with 



/i = P~\Pifli + P2fl2), 
P-' = {Pi + P2)-\ 



P = P1+P2, 

^ p-l(^lPl+/i2P2) 
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and the scalar constant determined by 



— exp U{P,fil{p-'P, - 1) + P2fil{P-'P2 - 1) + 2P-iPiP2/ii/i2)). (2.15) 

l-T 2 ^ 

Hence, the product of the two Gaussian densities is C ■ J\f{fi, P~^). 



□ 




Figure 2.2: Belief propagation message flow 



Fig. 2.2 plots a portion of a certain graph, describing the neighborhood of node i. Each 
node (empty circle) is associated with a variable and self potential 0, which is a function of this 
variable, while edges go with the pairwise (symmetric) potentials Messages are propagating 
along the edges on both directions (only the messages relevant for the computation of my are 
drawn in Fig. 2.2). Looking at the right hand side of the integral-product rule (2.4), node i needs 
to first calculate the product of all incoming messages, except for the message coming from node 
j. Recall that since p(x) is jointly Gaussian, the factorized self potentials oc M{^ii,P^i^) 

(2.8) and similarly all messages rrikiixi) oc J\f{fXki, Pki^) si'e of Gaussian form as well. 

As the terms in the product of the incoming messages and the self potential in the integral- 
product rule (2.4) are all a function of the same variable, Xi (associated with the node i), then, 
according to the multivariate extension of Lemma 12, 



M{^li\j,P^s^]) (X (l)i{Xi) \Y ^ki{Xi) 

fceN(i)\i 



(2.16) 



Applying the multivariate version of the product precision expression in (2.10), the update rule 
for the inverse variance is given by (over-braces denote the origin of each of the terms) 



(t>i(xi) m^iixi) 
Pa ~l~ ^ ^ Pki ) 



(2.17) 



fc6N(i)\j 
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where Pa = An is the inverse variance a-priori associated with node i, via the precision of 
(pi{xi), and Pki are the inverse variances of the messages mki{xi). Similarly using (2.9) for the 
multivariate case, we can calculate the mean 

(f)i{xi) rrn:i{xi) 

= P^y (^Piifiii + ^ PkifJ'ki^, (2.18) 

feGN(i)\i 

where fiu = hrj An is the mean of the self potential and are the means of the incoming 
messages. 

Next, we calculate the remaining terms of the message niij^Xj), including the integration 
over Xi. 

mij{xj) oc / 'ilJij{xi,Xj)(f)i{xi) Y\_ 'rnki{xi)dxi (2-19) 

A:GN(i)\j 

f ' 2 ' 

OC / exp {-XiAi^Xj) exp [-Pi\^{xj2 - fii\jXi)) dxi (2.20) 

Jxi 

= / exp {{-Pi\jx'^j2) + {Pi\jfJ^i\j - AijXj)xi)dxi (2.21) 

Jxi 

oc exp {{Pi\jHi\j - AijXjfl i2Pi\j)) (2.22) 
oc Afifiij = -P^j'Ajfi^j, Pr^' = -A7^P^), (2.23) 

where the exponent (2.22) is obtained by using the Gaussian integral (2.24): 

/oo 
exp {—ax^ + hx)dx = \/ vr/aexp (6^/4a), (2.24) 
-oo 

we find that the messages mij(xj) are proportional to normal distribution with precision and 
mean 

i^, = -4i>^, (2.25) 
Hij = -Pr.^Aij^i\j . (2.26) 

These two scalars represent the messages propagated in the Gaussian BP-based algorithm. 

Finally, computing the product rule (2.5) is similar to the calculation of the previous prod- 
uct (2.16) and the resulting mean (2.18) and precision (2.17), but including all incoming mes- 
sages. The marginals are inferred by normalizing the result of this product. Thus, the marginals 
are found to be Gaussian probability density functions J\f{fii, Pf^) with precision and mean 

ct>i{xi) m.kiixi) 
fc6N(i) 

^l^ = p;\][p^i+ Ka^), (2.28) 

fc6N(i) 
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respectively. The derivation of the GaBP-based solver algorithm is concluded by simply substi- 
tuting the explicit derived expressions of Pi\j (2.17) into Pij (2.25), (2.18) and Pij (2.25) 
into /iy (2.26) and Pi\j (2.17) into fa (2.28). 

The message passing in the GaBP solver can be performed subject to any scheduling. We 
refer to two conventional messages updating rules: parallel (flooding or synchronous) and serial 
(sequential, asynchronous) scheduling. In the parallel scheme, messages are stored in two data 
structures: messages from the previous iteration round, and messages from the current round. 
Thus, incoming messages do not affect the result of the computation in the current round, since 
it is done on messages that were received in the previous iteration round. Unlike this, in the serial 
scheme, there is only one data structure, so incoming messages in this round, change the result of 
the computation. In a sense it is exactly like the difference between the Jacobi and Gauss-Seidel 
algorithms, to be discussed in the following. Some in-depth discussions about parallel vs. serial 
scheduling in the BP context (the discrete case) can be found in the work of Elidan et al. [40]. 

Algorithm 1. 



1. 


Initialize: 


/ 


Set the neighborhood N(i) to include 










VA; ^ zBAki + 0. 








/ 


Set the scalar fixes 










Pa = An and = k/An, \fi. 








/ 


Set the initial N(2) 3 k i scalar messaj 


les 








Pki = and fiki = 0. 








/ 


Set a convergence threshold e. 




2. 


Iterate: 


/ 


Propagate the N(?) 3 k ^ i messages 










P^i and fiki, (under certain scheduling). 








/ 


Compute the N(j) 3 i ^ j scalar messages 










Pij = —Afj/ (Pjj + J2k€N(i)\j Pki) , 










l^ij ~ {Piif'ii + ^keN(i)\j PkifJ'ki) / Aij . 




3. 


Check: 


/ 


If the messages Pij and fiij did not 










converge (w.r.t. e) , return to Step 2. 








/ 


Else, continue to Step 4. 




4. 


Infer : 


/ 


Compute the marginal means 










A'-i = {PiiHii + Ylik&i(i) Pkij^ki) / {Pii + ^fcgN(i) ^k 










Optionally compute the marginal precisions 








Pi — Pa + XlfegN(i) ^ki ) 




5. 


Solve: 


/ 


Find the solution 










X* = Hi, 
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2.4 Max-Product Rule 

A well-known alternative version to the sum-product BP is the max-product (a.k.a. min-sum) 
algorithm [41]. In this variant of BP, maximization operation is performed rather than marginal- 
ization, i.e. , variables are eliminated by taking maxima instead of sums. For Trellis trees {e.g. , 
graphically representing convolutional codes or ISI channels), the conventional sum-product BP 
algorithm boils down to performing the BCJR algorithm [42], resulting in the most probable sym- 
bol, while its max-product counterpart is equivalent to the Viterbi algorithm [43], thus inferring 
the most probable sequence of symbols [39]. 

In order to derive the max-product version of the proposed GaBP solver, the integral (sum)- 
product rule (2.4) is replaced by a new rule 

mij{xj) (X aTgmaxipij{xi,Xj)(j)i{xi) Y\_ mki{xi)dxi. (2.29) 

feeN(»)\j 

Computing mij{xj) according to this max-product rule, one gets 

mij{xj) oc aTgmaxipij{xi,Xj)(j)i{xi) Y\_ ^ki{xi) (2.30) 

fc6N(i)\j 

OC argmaxexp [—XiAijXj) exp {—Pi\j{x,- /2 — iii\jXi)) (2-31) 
= argmaxexp {{-Pi\jx'^J 2) + {Pi\jfii\j - AijXj)xi). (2.32) 

Xi 

Hence, x™^^, the value of maximizing the product nfceN(i)\j given 

by equating its derivative w.r.t. Xi to zero, yielding 

^max _ - ^^3^3 (2.33) 



Substituting x™^"" back into the product, we get 



mij{xj) oc exp{{Pi\jHi\j - AijXjY/{2Pi\j)) (2.34) 
oc Ar(^,, = -P,r^'A,fi,\,, P7^ = -Ar.'PA.), (2.35) 

which is identical to the result obtained when eliminating Xj via integration (2.23). 

mi,{x,) oc = -Pr'A,fi,\,, Pr' = -A-.'Paj), (2.36) 

which is identical to the messages derived for the sum-product case (2.25)-(2.26). Thus interest- 
ingly, as opposed to ordinary (discrete) BP, the following property of the GaBP solver emerges. 

Corollary 13. The max-product (2.29) and sum-product (2.4) versions of ttie GaBP solver are 
identical. 
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2.5 Convergence and Exactness 

In ordinary BP, convergence does not entail exactness of the inferred probabilities, unless the 
graph has no cycles. Luckily, this is not the case for the GaBP solver. Its underlying Gaussian 
nature yields a direct connection between convergence and exact inference. Moreover, in contrast 
to BP the convergence of GaBP is not limited for tree or sparse graphs and can occur even for 
dense (fully-connected) graphs, adhering to certain rules discussed in the following. 

We can use results from the literature on probabilistic inference in graphical models [8,9,10] 
to determine the convergence and exactness properties of the GaBP-based solver. The following 
two theorems establish sufficient conditions under which GaBP is guaranteed to converge to the 
exact marginal means. 

Theorem 14. [8, Claim 4] If the matrix A is strictly diagonally dominant, then GaBP converges 
and the marginal means converge to the true means. 

This sufficient condition was recently relaxed to include a wider group of matrices. 

Theorem 15. [9, Proposition 2] If the spectral radius of the matrix A satisfies 

p(|I„-A|)<l, (2.37) 

then GaBP converges and the marginal means converge to the true means. (The assumption here 
is that the matrix A is first normalized by multiplying with D^^/^AD"^/^, where D = diag(A).j 

A third and weaker sufficient convergence condition (relative to Theorem 15) which charac- 
terizes the convergence of the variances is given in [6, Theorem 2]: For each row in the matrix A, 
if Afj^ > 'Ej^iAfj then the variances converge. Regarding the means, additional condition related 
to Theorem 15 is given. 

There are many examples of linear systems that violate these conditions, for which GaBP 
converges to the exact means. In particular, if the graph corresponding to the system is acyclic 
{i.e. , a tree), GaBP yields the exact marginal means (and variances [8]), regardless of the value 
of the spectral radius of the matrix [8]. In contrast to conventional iterative methods derived 
from linear algebra, understanding the conditions for exact convergence remain intriguing open 
problems. 
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Chapter 3 

GaBP Algorithm Properties 



Starting this chapter, we present our novel contributions in this GaBP domain. We provide a 
theoretical analysis of GaBP convergence rate assuming diagonally dominant inverse covariance 
matrix. This is the first time convergence rate of GaBP is characterized. Chapter 3 further 
presents our novel broadcast version of GaBP, which reduces the number of unique messages 
on a dense graph from O(n^) to 0{n). This version allows for efficient implementation on 
communication networks that supports broadcast such as wireless and Ethernet. (Example of an 
efficient implementation of the broadcast version on top of 1,024 CPUs is reported in Chapter 
8). We investigate the use of acceleration methods from linear algebra to be applied for GaBP 
and compare methodically GaBP algorithm to other linear iterative methods. 

3.1 Upper Bound on Convergence Rate 

in this section we give an upper bound on convergence rate of the GaBP algorithm. As far as we 
know this is the first theoretical result bounding the convergence speed of the GaBP algorithm. 

Our upper bound is based on the work of Weiss et al. [44, Claim 4], which proves the 
correctness of the mean computation. Weiss uses the pairwise potentials form^, where 



We further assume scalar variables. Denote the entries of the inverse pairwise covariance matrix 
V,, and the inverse covariance matrix V„ as: 



p(x) oc Uijtpij{xi,Xj)Ui'4ji{xi) , 





Weiss assumes variables with zero means. The mean value does not affect convergence speed. 
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Assuming the optimal solution is x*, for a desired accuracy e||b||oo where ||b||oo = maxj |bi|, 
and b is the shift vector, [44, Claim 4] proves that the GaBP algorithm converges to an accuracy 
of \x* — Xt\ < e||b||oo after at most t = [log(e)/log(/3)] rounds, where P = maxj^ \bij/cij\. 

The problem with applying Weiss' result directly to our model is that we are working with 
different parameterizations. We use the information form p(x) oc exp(— |x^Ax + b^x). The 
decomposition of the matrix A into pairwise potentials is not unique. In order to use Weiss' 
result, we propose such a decomposition. Any decomposition from the information form to the 
pairwise potentials form should be subject to the following constraints [44] 

Pairwise form ■ r .■ r 

Information form 

bij ciij , 

which means that the inverse covariance in the pairwise model should be equivalent to inverse 
covariance in the information form. 

Pairwise form , . . . 
, ^ , information form 



Cij di 



The second constraints says that the sum of node i's inverse variance (in both the self potentials 
and edge potentials) should be identical to the inverse variance in the information form. 

We propose to initialize the pairwise potentials as following. Assuming the matrix A is 
diagonally dominant, we define e-i to be the non negative gap 



and the following decomposition 

bij = ttij, dij = dj + ei/\N{i) \ , 
where |A^(i)| is the number of graph neighbors of node i. Following Weiss, we define 7 to be 

1^*7 1 1*^*7 1 -1 

7 = max —— = max ■. .. ^^,.,1 = max :; ; — .. ^^,.,|, < 1 . 

M \cij\ i,j \aij\+ei/\N{i)\ l + {e^)/{\a^j\\N{i)\) 

In total, we get that for a desired accuracy of e||b||oo we need to iterate for t = [log(e)/log(7)] 
rounds. Note that this is an upper bound and in practice we indeed have observed a much faster 
convergence rate. 

The computation of the parameter 7 can be easily done in a distributed manner: Each node 
locally computes £j, and 7j = maxj 1/(1 + \aij\ei/N{i)). Finally, one maximum operation is 
performed globally, 7 = maxj7j. 
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3.2 Convergence Acceleration 

Further speed-up of GaBP can be achieved by adapting known acceleration techniques from linear 
algebra, such Aitken's method and Steffensen's iterations [45]. Consider a sequence {x„} where 
n is the iteration number, and x is the marginal probability computed by GaBP. Further assume 
that {x„} linearly converge to the limit x, and x„ 7^ x for n > 0. According to Aitken's method, 
if there exists a real number a such that \a\ < 1 and lim„^oo(xn — x)/ (x„_i — x) = a, then the 
sequence {yn} defined by 

_ (Xn+l — X„)'^ 

Yn X„ - ■ 

Xn+2 ~ ^X„+i + X^ 

converges to x faster than {x„} in the sense that lim^^oo — y„)/(x — x„)| = 0. Aitken's 
method can be viewed as a generalization of over-relaxation, since one uses values from three, 
rather than two, consecutive iteration rounds. This method can be easily implemented in GaBP 
as every node computes values based only on its own history. 

Steffensen's iterations incorporate Aitken's method. Starting with x„, two iterations are run 
to get x„+i and x„+2- Next, Aitken's method is used to compute y„, this value replaces the 
original x„, and GaBP is executed again to get a new value of x„+i. This process is repeated 
iteratively until convergence. We remark that, although the convergence rate is improved with 
these enhanced algorithms, the region of convergence of the accelerated GaBP solver remains 
unchanged. Chapter 4 gives numerical examples to illustrate the proposed acceleration method 
performance. 

3.3 GaBP Broadcast Variant 

For a dense matrix A each node out of the n nodes sends a unique message to every other node 
on the fully-connected graph. This recipe results in a total of messages per iteration round. 

The computational complexity of the GaBP solver as described in Algorithm 1 for a dense 
linear system, in terms of operations (multiplications and additions) per iteration round, is shown 
in Table 3.1. In this case, the total number of required operations per iteration is 0{n^). This 
number is obtained by evaluating the number of operations required to generate a message 
multiplied by the number of messages. Based on the summation expressions for the propagating 
messages Py and yU^, it is easily seen that it takes 0{n) operations to compute such a message. 
In the dense case, the graph is fully-connected resulting in 0{n'^) propagating messages. 

In order to estimate the total number of operations required for the GaBP algorithm to solve 
the linear system, we have to evaluate the number of iterations required for convergence. It is 
known [46] that the number of iterations required for an iterative solution method is 0{f{K,)), 
where /(k) is a function of the condition number of the data matrix A. Hence the total complexity 
of the GaBP solver can be expressed by 0{n^) x 0{f{K)). The analytical evaluation of the 
convergence rate function /(k) is a challenging open problem. However, it can be upper bounded 
by f{K) < f^- Furthermore, based on our experimental study, described in Section 4, we can 
conclude that /(k) < ^/K,. This is because typically the GaBP algorithm converges faster than 
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Algorithm 


Operations per msg 


msgs 


Total operations per iteration 


Naive GaBP (Algorithm 1) 


0{n) 






Broadcast GaBP (Algorithm 2) 


0{n) 


0{n) 





Table 3.1: Computational complexity of the GaBP solver for dense n x n matrix A. 



the SOR algorithm. An upper bound on the number of iterations of the SOR algorithm is 
y/n. Thus, the total complexity of the GaBP solver in this case is 0{n^) x 0{^/K). For well- 
conditioned (as opposed to ill-conditioned) data matrices the condition number is 0{1). Thus, 
for well-conditioned linear systems the total complexity is 0{n'^), i.e. , the complexity is cubic, 
the same order of magnitude as for direct solution methods, like Gaussian elimination. 

At first sight, this result may be considered disappointing, with no complexity gain w.r.t. 
direct matrix inversion. Luckily, the GaBP implementation as described in Algorithm 1 is a naive 
one, thus termed naive GaBP. In this implementation we did not take into account the correlation 
between the different messages transmitted from a certain node i. These messages, computed 
by summation, are distinct from one another in only two summation terms. 

Instead of sending a message composed of the pair of i^iij and P^, a node can broadcast the 
aggregated sums 

Pi = Pii+ Yl (3-1) 

ifceN(i) 

fii = Pj^^{Piifiii+ ^ Pki^-ki)- (3.2) 
fceN(i) 

Consequently, each node can retrieve locally the Pi\j (2.17) and Hi^j (2.18) from the sums by 
means of a subtraction 

Pi\j Pi Pji ! l^'^) 

/^Ai = l^i- Pi\]PiiHi- (3-4) 

The rest of the algorithm remains the same. On dense graphs, the broadcast version sends 0{n) 
messages per round, instead of 0{n^) messages in the GaBP algorithm. This construction is 
typically useful when implementing the GaBP algorithm in communication networks that support 
broadcast (for example Ethernet and wireless networks), where the messages are sent in broadcast 
anyway. See for example [47,48]. Chapter 8 brings an example of large scale implementation of 
our broadcast variant using 1,024 CPUs. 
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3.4. THE GABP- BASED SOLVER AND CLASSICAL SOLUTION 
GASP ALGORITHM PROPERTIES METHODS 



Algorithm 2. 



1. Initialize. 



2. Iterate: 



3. Check: 



4. Infer: 



5. Solve: 



/ 

/ 

/ 

/ 

/ 
/ 



/ 
/ 

/ 



Set the neighborhood N(i) to include 

VA; z3A. ^ 0. 

Set the scalar fixes 

Pa = An and fiu = hi/An, 

Set the initial i N(i) broadcast messages 

Pj = and /ij = 0. 

Set the initial N(i) 3 k ^ i internal scalars 

Pki = and fiti = 0. 

Set a convergence threshold e. 

Broadcast the aggregated sum messages 

Pi = Pii + XlfceN(i) ^ki, 

(under certain scheduling). 
Compute the N(j) 3 i j internal scalars 

Pij = —/^Ij/ {Pi ~ Pji)' 

fJ'ij {PifJ'i Pjif^ji^/^ij- 

If the internal scalars P^j and fiij did not 
converge (w.r.t. e) , return to Step 2. 
Else, continue to Step 4. 
Compute the marginal means 

l^i = {Piif'ii + X]fcGN(i) Pkil^ki) / {Pii + X]fcGN(i) Pki) = P'i' 

Optionally compute the marginal precisions 

Pi = Pii + X]fceN(i) Pki = -P« ) 

Find the solution 

X* = fii, Vz. 



3.4 The GaBP-Based Solver and Classical Solution 
Methods 

3.4.1 Gaussian Elimination 

Proposition 16. The GaBP-based solver (Algorithm 1) for a system of linear equations repre- 
sented by a tree graph is identical to the renowned Gaussian elimination algorithm (a.k.a. LU 
factorization, [46]). 

Proof. Consider a set of n linear equations with n unknown variables, a unique solution and a 
tree graph representation. We aim at computing the unknown variable associated with the root 
node. Without loss of generality as the tree can be drawn with any of the other nodes being its 
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root. Let us enumerate the nodes in an ascending order from the root to the leaves (see, e.g., 
Fig. 3.1). 




Figure 3.1: Example topology of a tree with 5 nodes 



As in a tree, each child node [i.e. , all nodes but the root) has only one parent node and 
based on the top-down ordering, it can be easily observed that the tree graph's corresponding 
data matrix A must have one and only one non-zero entry in the upper triangular portion of its 
columns. Moreover, for a leaf node this upper triangular entry is the only non-zero off-diagonal 
entry in the whole column. See, for example, the data matrix associated with the tree graph 
depicted in Fig 3.1 



/ All 


Ai2 


Ai3 








\ 


Al2 


A22 





A24 


A25 




^13 





A33 


















A44 







I 


^25 








A55 


/ 



(3.5) 



where the non-zero upper triangular entries are in bold and among these the entries corresponding 
to leaves are underlined. 

Now, according to GE we would like to lower triangulate the matrix A. This is done by 
eliminating these entries from the leaves to the root. Let / be a leaf node, i be its parent and j 
be its parent (/'th node grandparent). Then, the /'th row is multiplied by —An/ An and added to 
the z'th row. in this way the An entry is being eliminated. However, this elimination, transforms 
the z'th diagonal entry to be An An — AfJAu, or for multiple leaves connected to the same 
parent An An - E«GN(i)j ^hM«/- In our example. 



/ An 


A12 











\ 


A12 


A22 - AIJA33 - AlJA^i - Al^/A^^ 













An 





A33 













^24 





A44 







V 


^25 








A55 


/ 



(3.6) 



Thus, in a similar manner, eliminating the parent i yields the multiplication of the j'th diagonal 
term by —A^MAn — X]i6N(i)i ^^iM")- Recalling that Pn = An, we see that the last expression 
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is identical to the update rule of Pj, in GaBP. Again, in our example 





B 





















c 















An 





^33 




















Au 







v 





^25 








A55 


I 



(3.7) 



C = A 



22 



AIJA 



33 



/A 



44 



where 5 = An - ^22/(^22 - Afg/Agg - AlJA^^ - AlJA^^) 
A^^/A^^. Now the matrix is fully lower triangulated. To put differently in terms of GaBP, the Pij 
messages are subtracted from the diagonal Pa terms to triangulate the data matrix of the tree. 
Performing the same row operations on the right hand side column vector b, it can be easily seen 
that we equivalently get that the outcome of the row operations is identical to the GaBP solver's 
fXij update rule. These updadtes/row operations can be repeated, in the general case, until the 
matrix is lower triangulated. 

Now, in order to compute the value of the unknown variable associated with the root node, all 
we have to do is divide the first diagonal term by the transformed value of fei, which is identical 
to the infer stage in the GaBP solver (note that by definition all the nodes connected to the root 
are its children, as it does not have parent node). In the example 



All - A?2/(A 



22 



AlJAs3 - A24/A44 - AIJA55) 



(3.8) 



bn - Ai2/(622 - A13/A33 - A24/A44 - A25/A55) 

Note that the rows corresponding to leaves remain unchanged. 

To conclude, in the tree graph case, the 'iterative' stage (stage 2 on algorithm 1) of the 
GaBP solver actually performs lower triangulation of the matrix, while the 'infer' stage (stage 4) 
reducers to what is known as forward substitution. Evidently, using an opposite ordering, one can 
get the complementary upper triangulation and back substitution, respectively. □ 

It is important to note, that based on this proposition, the GaBP solver can be viewed as 
GE run over an unwrapped version {i.e. , a computation tree as defined in [8]) of a general loopy 
graph. 



3.4.2 Iterative Methods 

Iterative methods that can be expressed in the simple form 

xW = Bx(*-i) + c, (3.9) 

where neither the iteration matrix B nor the vector c depend upon the iteration number t, are 
called stationary iterative methods. In the following, we discuss three main stationary iterative 
methods: the Jacobi method, the Gauss-Seidel (GS) method and the successive overrelaxation 
(SOR) method. The GaBP-based solver, in the general case, can not be written in this form, 
thus can not be categorized as a stationary iterative method. 

Proposition 17. [46] Assuming I — B /s invertible, then the iteration 3.9 converges (for any 
initial guess, x.^^^). 
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3.4.3 Jacob! Method 

The Jacobi method (Gauss, 1823, and Jacobi 1845, [2]), a.k.a. the simultaneous iteration method, 
is the oldest iterative method for solving a square linear system of equations Ax = b. The method 
assumes that Vj An ^ 0. It's complexity is 0{n^) per iteration. There are two know sufficient 
convergence conditions for the Jacobi method. The first condition holds when the matrix A is 
strictly or irreducibly diagonally dominant. The second condition holds when p(D^^(L+U)) < 1. 
Where D = diag(A), L, U are upper and lower triangular matrices of A. 

Proposition 18. The GaBP-based solver (Algorithm 1) 

1. with inverse variance messages set to zero, i.e. , Pij = 0, z G N{j),\fj; 

2. incorporating the message received from node j when computing the message to be sent 
from node i to node j, i.e. replacing k E N{i)\j with k E N(i), 

is identical to the Jacobi iterative method. 

Proof. Setting the precisions to zero, we get in correspondence to the above derivation, 

-^Ai ~ ~ (3.10) 

Pijl^ij -Aij , (3.11) 

(Xi = A^^{bi- ^ AkiUkXi)- (3-12) 

A:eN(j) 

Note that the inverse relation between Pij and Pi\j (2.25) is no longer valid in this case. 
Now, we rewrite the mean (2.18) without excluding the information from node j, 

l^i\j = ^u\bi- ^ AkiHk\i)- (3.13) 

fcGN(i) 

Note that = jji, hence the inferred marginal mean pi (3.12) can be rewritten as 

^ii = ATl'ibi-Y.^kif^k), (3.14) 

where the expression for all neighbors of node i is replaced by the redundant, yet identical, 
expression k ^ i. This fixed-point iteration is identical to the renowned Jacobi method, concluding 
the proof. □ 

The fact that Jacobi iterations can be obtained as a special case of the GaBP solver further 
indicates the richness of the proposed algorithm. 



24 



Chapter 4 

Numerical Examples 



In this chapter we report experimental study of three numerical examples: toy linear system, 2D 
Poisson equation and symmetric non-PSD matrix. In all examples, but the Poisson's equation 4.8, 
b is assumed to be an m-length all-ones observation vector. For fairness in comparison, the initial 
guess in all experiments, for the various solution methods under investigation, is taken to be the 
same and is arbitrarily set to be equal to the value of the vector b. The stopping criterion in all 
experiments determines that for all propagating messages (in the context the GaBP solver) or 
all n tentative solutions (in the context of the compared iterative methods) the absolute value 
of the difference should be less than e < 10~^. As for terminology, in the following performing 
GaBP with parallel (flooding or synchronous) message scheduling is termed 'parallel GaBP', while 
GaBP with serial (sequential or asynchronous) message scheduling is termed 'serial GaBP'. 



4.1 Numerical Example: Toy Linear System: 3x3 Equa- 
tions 

Consider the following 3x3 linear system 

(4.1) 



A 


= 1 


A 

^xy 


= -2 


A 

■'^xz 


= 3 
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^yx 


= -2 
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= 1 


Ayz 
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A-zx 


= 3 


A 

^zy 
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A-zz 


= 1 











)l 
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\-\ 






{ z ) 







A X b 



We would like to find the solution to this system, x* = {x*, y*, 2;*}^. Inverting the data matrix 
A, we directly solve 

-1/12 -1/6 1/4 

-1/6 2/3 1/2 I I I = I 2 I . (4.2) 

1/4 1/2 1/4 
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Alternatively, we can now run the GaBP solver. Fig. 4.1 displays the graph, corresponding to 
the data matrix A, and the message-passing flow. As Ay^ = A^y = 0, this graph is a cycle-free 
tree, thus GaBP is guaranteed to converge in a finite number of rounds. As demonstrated in the 
following, in this example GaBP converges only in two rounds, which equals the tree's diameter. 
Each propagating message, rriij, is described by two scalars fiij and Pij, standing for the mean 
and precision of this distribution. The evolution of the propagating means and precisions, until 
convergence, is described in Table 4.1, where the notation t = 0,1,2,3 denotes the iteration 
rounds. Converged values are written in bold. 
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6 


6 
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l-^xz 


i^PxxfJ'xx ~l~ Pyxf^yx) /Axz 
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-2 
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l^zx 


Pzzl^zz / -^zx 





2/3 


2/3 


2/3 



Table 4.1: Evolution of means and precisions on a tree with three nodes 

Next, following the GaBP solver algorithm, we infer the marginal means. For exposition 
purposes we also present in Table 4.2 the tentative solutions at each iteration round. 



Solution 


Computation 


t=0 


t=l 


t=2 


t=3 


l^x 


i^PxxfJ'xx ~l~ Pzxl^zx ~l~ Pyx f^yx) / i^Pxx ~l~ Pzx ~l~ Pyx) 
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1 


1 


1 


f^y 


[PyyfJ-yy + Pxyl^xy) / {Pyy + Pxy) 





4 


2 


2 


fJ'Z 


i^Pzzf^zz ~l~ Pxzl^xz) / i^Pzz ~l~ Pxz) 


2 


-5/2 


-1 


-1 



Table 4.2: Tentative means computed on each iteration until convergence 

Thus, as expected, the GaBP solution x* = {x* = l,y* = 2,z* = —1}^ is identical to what 
is found using the direct approach. Note that as the linear system is described by a tree graph, 
then for this particular case, the inferred precision is also exact 

Px Pxx ~l~ Pyx ~l~ Pzx 12, (4.3) 

Py ~ Pyy ~^ P^y ~ 3/2, (^■^) 

Pz = Pzz + Pxz = ^. (4.5) 

(4.6) 

and gives {P~' = {A-'}xx = -1/12, P;' = {A'^jyy = 2/3, = {A-^},, = 1/4}^, /.e.the 
true diagonal values of the data matrix's inverse, A^^. 
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4.2. NON PSD EXAMPLE 




Figure 4.1: A tree topology with three nodes 

4.2 Numerical Example: Symmetric Non-PSD Data Ma- 
trix 

Consider the case of a linear system with a symmetric, but non-PSD data matrix 

(4.7) 




Table 4.3 displays the number of iterations required for convergence for the iterative methods 
under consideration. The classical methods diverge, even when aided with acceleration techniques. 
This behavior (at least without the acceleration) is not surprising in light of Theorem 15. Again 
we observe that serial scheduling of the GaBP solver is superior parallel scheduling and that 
applying Steffensen iterations reduces the number of iterations in 45% in both cases. Note that 
SOR cannot be defined when the matrix is not PSD. By definition CG works only for symmetric 
PSD matrices. Because the solution is a saddle point and not a minimum or maximum. 



4.3 Application Example: 2-D Poisson's Equation 

One of the most common partial differential equations (PDEs) encountered in various areas 
of exact sciences and engineering (e.g. , heat flow, electrostatics, gravity, fluid flow, quantum 
mechanics, elasticity) is Poisson's equation. In two dimensions, the equation is 

Au{x,y) = f{x,y), (4.8) 

for {x, y} E ^l, where 
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Algorithm 


Iterations t 


Jacobi,GS,SR,Jacobi+Aitkens,Jacobi+Steffensen 


— 


Parallel GaBP 


38 


Serial GaBP 


25 


Parallel GaBP+Steffensen 


21 


Serial GaBP+StefFensen 


14 



Table 4.3: Symmetric non-PSD 3x3 data matrix. Total number of iterations required for 
convergence (threshold e = 10^^) for GaBP-based solvers vs. standard methods. 




2 4 6 8 10 

Iteration t 



Figure 4.2: Convergence rate for a 3 x 3 symmetric non-PSD data matrix. The Frobenius norm 
of the residual per equation, 1 1 Ax* — b\ {p/n, as a function of the iteration t for GS (triangles and 
solid line), Jacobi (squares and solid line), SR (stars and solid line), parallel GaBP (circles and 
solid line) and serial GaBP (circles and dashed line) solvers. 



is the Laplacian operator and is a bounded domain in M?. The solution is well defined only under 
boundary conditions, i.e. , the value of u{x, y) on the boundary of Vt is specified. We consider the 
simple (Dirichlet) case of u(x,y) = for {x,y} on the boundary of fi. This equation describes, 
for instance, the steady-state temperature of a uniform square plate with the boundaries held at 
temperature u = 0, and f{x,y) equaling the external heat supplied at point {x,y}. 

The poisson's PDF can be discretized by using finite differences. An p + 1 xp+1 square grid 
on with size (arbitrarily) set to unity is used, where h = 1) is the grid spacing. We let 
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Figure 4.3: The left graph depicts accelerated convergence rate for a 3 x 3 symmetric non-PSD 
data matrix. The Frobenius norm of the residual per equation, ||Ax* — hWr/n, as a function of 
the iteration t for Aitkens (squares and solid line) and Steffensen-accelerated (triangles and solid 
line) Jacobi method, parallel GaBP (circles and solid line) and serial GaBP (circles and dashed 
line) solvers accelerated by Steffensen iterations. The right graph shows a visualization of parallel 
GaBP on the same problem, drawn in M'^. 




10 20 30 40 50 60 70 80 90 100 
Index j 



Figure 4.4: Image of the corresponding sparse data matrix for the 2-D discrete Poisson's PDE 
with p = 10. Empty (full) squares denote non-zero (zero) entries. 



U{i,j), {i,j = 0, . . . ,p + 1}, be the approximate solution to the PDE at x = ih and y = jh. 
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Approximating the Laplacian by 



AU{x,y) 



U{i + l,])-2U{i,j) + U{i-l,j) , f/(^,J + l)-2f/(^,J) + ^(^,J-l) 



/i2 

one gets the system n = linear equations with n unknowns 

4^7(2, j) - U{i -1,3)- U{i + 1, j) - U{i,3 - 1) - U{i,3 + 1) = K^, j)V^, J 



(4.10) 



where = —f{ih,jh)h'^, the scaled value of the function f{x,y) at the corresponding grid 

point {i,j}. Evidently, the accuracy of this approximation to the PDE increases with n. 

Choosing a certain ordering of the unknowns U{i,j), the linear system can be written in 
a matrix-vector form. For example, the natural row ordering {i.e., enumerating the grid points 
left— s>right, bottom^up) leads to a linear system with xp^ sparse data matrix A. For example, 
a Poisson PDE with p = 3 generates the following 9x9 linear system 
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(4.11) 



X 



b 



where blank data matrix A entries denote zeros. 

Hence, now we can solve the discretized 2-D Poisson's PDE by utilizing the GaBP algorithm. 
Note that, in contrast to the other examples, in this case the GaBP solver is applied for solving 
a sparse, rather than dense, system of linear equations. 

In order to evaluate the performance of the GaBP solver, we choose to solve the 2-D Poisson's 
equation with discretization of p = 10. The structure of the corresponding 100 x 100 sparse data 
matrix is illustrated in Fig. 4.4. 
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Algorithm 


Iterations t 


Jacobi 


354 


GS 


136 


Optimal SOR 


37 


Parallel GaBP 


134 


Serial GaBP 


73 


Parallel GaBP+Aitkens 


25 


Parallel GaBP+Steffensen 


56 


Serial GaBP+Steffensen 


32 



Table 4.4: 2-D discrete Poisson's PDE with p = 3 and f{x,y) = —1. Total number of iterations 
required for convergence (threshold e = 10~^) for GaBP-based solvers vs. standard methods. 




5 10 15 20 25 

Iteration t 



Figure 4.5: Accelerated convergence rate for the 2-D discrete Poisson's PDE with p = 10 and 
f{x,y) = —l. The Frobenius norm of the residual, per equation, 1 1 Ax* — 6| as a function 
of the iteration t for parallel GaBP solver accelrated by Aitkens method (x-marks and solid line) 
and serial GaBP solver accelerated by Steffensen iterations (left triangles and dashed line). 
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Algorithm 


Iterations t 


Jacobi,GS,SR,Jacobi+Aitkens,Jacobi+Steffensen 




Parallel GaBP 


84 


Serial GaBP 


30 


Parallel GaBP+Steffensen 


43 


Serial GaBP+Steffensen 


17 



Table 4.5: Asymmetric 3x3 data matrix, total number of iterations required for convergence 
(threshold e = 10^^) for GaBP-based solvers vs. standard methods. 




Iteration t Iteration t 



Figure 4.6: Convergence of an asymmetric 3x3 matrix. 
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Figure 4.7: Convergence of a 3 x 3 asymmetric matrix, using 3D plot. 
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Chapter 5 

Fixing convergence of GaBP 



In this chapter, we present a novel construction that fixes the convergence of the GaBP algorithm, 
for any Gaussian model with positive-definite information matrix (inverse covariance matrix), 
even when the currently known sufficient convergence conditions do not hold. We prove that our 
construction converges to the correct solution. Furthermore, we consider how this method may 
be used to solve for the least-squares solution of general linear systems. We defer experimental 
results evaluating the efficiency of the convergence fix to Chapter 7.2 in the context of linear 
detection. By using our convergence fix construction we are able to show convergence in practical 
CDMA settings, where the original GaBP algorithm did not converge, supporting a significantly 
higher number of users on each cell. 

5.1 Problem Setting 

We wish to compute the maximum a posteriori (MAP) estimate of a random vector x with 
Gaussian distribution (after conditioning on measurements): 

p{x) oc exp{ — ix^Jx -I- h^x} , (5.1) 

where J >- is a symmetric positive definite matrix (the information matrix) and h is the potential 
vector. This problem is equivalent to solving Jx = h for x given (h, J) or to solve the convex 
quadratic optimization problem: 

minimize f{x) = |x^Jx — h^x. (5-2) 

We may assume without loss of generality (by rescaling variables) that J is normalized to have 
unit-diagonal, that is, J = I — R with R having zeros along its diagonal. The off-diagonal entries 
of R then correspond to partial correlation coefficients [49]. Thus, the fill pattern of R (and J) 
reflects the Markov structure of the Gaussian distribution. That is, p(x) is Markov with respect 
to the graph with edges Q = {{i,j)\rij ^ 0} . 

If the model J = I — R is walk-summable [9,10], such that the spectral radius of \R\ = {\rij\) 
is less than one {p{\R\) < 1), then the method of GaBP may be used to solve this problem. 
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5.2. DIAGONAL LOADING 



We note that the walk-summable condition implies I — R \s positive definite. An equivalent 
characterization of the walk-summable condition is that /— \R\ is positive definite. 

This chapter presents a method to solve non-walksummable models, where J = I — R '\s 
positive definite but p{\R\) > 1, using GaBP. There are two key ideas: (1) using diagonal loading 
to create a perturbed model J' = J + T which is walk-summable (such that the GaBP may be 
used to solve J'x = h for any h) and (2) using this perturbed model J' and convergent GaBP 
algorithm as a preconditioner in a simple iterative method to solve the original non-walksummable 
model. 

5.2 Diagonal Loading 

We may always obtain a walk-summable model by diagonal loading. This is useful as we can then 
solve a related system of equations efficiently using Gaussian belief propagation. For example, 
given a non-walk-summable model J = / — i? we obtain a related walk-summable model = 
J + that is walk-summable for large enough values of 7: 

Lemma 1. Let J = I - R and J' = J 7/ = (1 7)/ - R. Let 7 > 7* where 

l*=pi\R\)-l. (5.3) 

Then, J' is walk-summable and GaBP based on J' converges. 

Proof. We normalize J' = (1 + 7)/ -Rto obtain J^^^^ = I - R' with R' = {1 + -lY^R, which 
is walk-summable if and only if p(|-R'|) < 1. Using p(|-R'|) = (1 -|- 7)"^p(|i?|) we obtain the 
condition (1 -I- 7)"^p(|i?|) < 1, which is equivalent to 7 > p{\R\) — 1. □ 

It is also possible to achieve the same effect by adding a general diagonal matrix V to obtain 
a walk-summable model. For example, for all V > T*, where 7*^ = Ja — J^jj^i I'^ul' holds that 
J+T is diagonally-dominant and hence walk-summable (see [10]). More generally, we could allow 
r to be any symmetric positive-definite matrix satisfying the condition / -|- F :^ \R\. However, 
only the case of diagonal matrices is explored in this chapter. 

5.3 Iterative Correction Method 

Now we may use the diagonally-loaded model J' = J + T to solve Jx = h for any value of F > 0. 
The basic idea here is to use the diagonally-loaded matrix J' = J -|- F as a preconditioner for 
solving the Jx = h using the iterative method: 

= (J + F)"i(/i-fFx(*)). (5.4) 

Note that the effect of adding positive F is to reduce the size of the scaling factor (J -|- F)~^ 
but we compensate for this damping effect by adding a feedback term Fx to the input h. Each 
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step of this iterative method may also be interpreted as solving the following convex quadratic 
optimization problem based on the objective f(x) from (5.2): 

= argmin {/(x) + Ux - x^'Y^i^ - a^^*^)} • (5-5) 

This is basically a regularized version of Newton's method to minimize f{x), where we regularize 
the step-size at each iteration. Typically, this regularization is used to ensure positive-definiteness 
of the Hessian matrix when Newton's method is used to optimize a non-convex function. In 
contrast, we use it to ensure that J + F is walk-summable, so that the update step can be 
computed via Gaussian belief propagation. Intuitively, this will always move us closer to the 
correct solution, but slowly if T is large. It is simple to demonstrate the following: 

Lemma 2. Let J y and T ^ 0. Then, x^*) defined by (5.4) converges to x* = J^^h for all 
initializations x^^^ . 

Comment. The proof is given for a general (non-diagonal) F ^ 0. For diagonal matrices, this 
is equivalent to requiring Fjj > for i = 1, . . . , n. 

Proof First, we note that there is only one possible fixed-point of the algorithm and this is 
X* = J^^h. Suppose X is a fixed point: x = {J + T)~^(h + Fx). Hence, (J -I- F)x = h + Tx 
and Jx = h. For non-singular J, we must then have x = J~^h. Next, we show that the method 
converges. Let e'-*-' = x*-*^ — x* denote the error of the k-th estimate. The error dynamics are then 
= (J + r)^iFeW. Thus, = ((J + F)-iF)'=e(°) and the error converges to zero if and 
only if p((J + F)-iF) < 1, or equivalently p{H) < 1, where H = {J + Ty^/^T{J + T)-^/^ ^ is 
a symmetric positive semi-definite matrix. Thus, the eigenvalues of H are non-negative and we 
must show that they are less than one. It is simple to check that if A is an eigenvalue of H then 
is an eigenvalue of F^/^ J~^F^/^ >z 0. This is seen as follows: Hx = Ax, (J -|- T)~^Ty = Xy 
[y = (J+ry^/^x), Ty = X{J+T)y, {l-X)Ty = XJy, J~^Ty = -^y and V^I'^J-^V^I^z = -^z 
(z = F^/^y) [note that A 7^ 1, otherwise Jy = contradicting J y 0]. Therefore > and 
< A < 1. Then p{H) < 1, e^*) and x^*) x* completing the proof. □ 

Now, provided we also require that J' = J -|- F is walk-summable, we may compute x*^*'^^^ = 
(J -|- F)~^/i*^*+^\ where = h + Tx^^\ by performing Gaussian belief propagation to solve 

_ /^(t+i) Thus, we obtain a double-loop method to solve Jx = h. The inner-loop 
performs GaBP and the outer-loop computes the next h^^K The overall procedure converges 
provided the number of iterations of GaBP in the inner-loop is made large enough to ensure a good 
solution to J'x^*"*"^^ = Alternatively, we may compress this double-loop procedure into a 

single-loop procedure by preforming just one iteration of GaBP message-passing per iteration of 
the outer loop. Then it may become necessary to use the following damped update of h^^'^ with 
step size parameter s E (0, 1): 

= (l-s)/iW + s(/i + FxW) 

= /i + F((l - s)x(*^i) + sx^*)) , (5.6) 
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This single-loop method converges for sufficiently small values of s. In practice, we have found 
good convergence with s = |. This single-loop method can be more efficient than the double-loop 
method. 



5.4 Extension to General Linear Systems 

in this section, we efficiently extend the applicability of the proposed double-loop construction for 
a general linear system of equations (possibly over-constrained.) Given a full column rank matrix 
J E M"^'^, n > k, and a shift vector h, we are interested in solving the least squares problem 
miria; II Jx — h\\2. The naive approach for using GaBP would be to take the information matrix 
J = (J^J), and the shift vector h = J^h. Note that J is positive definite and we can use GaBP 
to solve it. The MAP solution is 

x = r^h= {J^jy^J^h, (5.7) 

which is the pseudo-inverse solution. 

Note, that the above construction has two drawbacks: first, we need to explicitly compute J 
and li, and second, J may not be sparse in case the original matrix J |s sparse. To overcome 
this problem, following [19], we construct a new symmetric data matrix J based on the arbitrary 
rectangular matrix J G R"^^ 

J A I "^kxh \ ^ j]^(fe+n)x(fc+n) 



J Onxn 

Additionally, we define a new hidden variable vector x = {x'^ G ]R('^+'^\ where x G M*^ 
is the solution vector and z G M" is an auxiliary hidden vector, and a new shift vector h = 

{0Li,h^reM('=+"). 

Lemma 3. Solving x = J^^li and taking the first k entries is identical to solving Eq. 10.17. 
Proof. Is given in [19]. 

For applying our double-loop construction on the new system {h, J) to obtain the solution to 
Eq. (10.17), we need to confirm that the matrix J is positive definite. (See lemma 2). To this 
end, we add a diagonal weighting —7/ to the lower right block: 



J -7/ 

Then we rescale J to make it unit diagonal (to deal with the negative sign of the lower right 
block we use a complex Gaussian notation as done in [50]). It is clear that for a large enough 
7 we are left with a walk-summable model, where the rescaled J is a hermitian positive definite 
matrix and p(| J — /|) < 1. Now it is possible to use the double-loop technique to compute Eq. 
10.17. Note that adding —7/ to the lower right block of J is equivalent to adding 7/ into Eq. 
7: 

X = {.Fj + -fI)-\Fh (5.8) 
where 7 can be interpreted as a regularization parameter. 
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Chapter 6 



Rating Users and Data Items in Social 
Networks 

We propose to utilize the distributed GaBP solver presented in Chapter 2 to efficiently and 
distributively compute a solution to a family of quadratic cost functions described below. By 
implementing our algorithm once, and choosing the computed cost function dynamically on the 
run we allow a high flexibility in the selection of the rating method deployed in the Peer-to-Peer 
network. 

We propose a unifying family of quadratic cost functions to be used in Peer-to-Peer ratings. 
We show that our approach is general since it captures many of the existing algorithms in the fields 
of visual layout, collaborative filtering and Peer-to-Peer rating, among them Koren spectral layout 
algorithm, Katz method. Spatial ranking. Personalized PageRank and Information Centrality. 
Beside of the theoretical interest in finding common basis of algorithms that were not linked 
before, we allow a single efficient implementation for computing those various rating methods. 

Using simulations over real social network topologies obtained from various sources, including 
the MSN Messenger social network, we demonstrate the applicability of our approach. We report 
simulation results using networks of millions of nodes. 

Whether you are browsing for a hotel, searching the web, or looking for a recommendation 
on a local doctor, what your friends like will bear great significance for you. This vision of 
virtual social communities underlies the stellar success of a growing body of recent web ser- 
vices, e.g., http://www.flickr.com, http://del.icio.us, http://www.myspace.com, and 
others. However, while all of these services are centralized, the full flexibility of social information- 
sharing can often be better realized through direct sharing between peers. 

This chapter presents a mechanism for sharing user ratings (e.g., on movies, doctors, and 
vendors) in a social network. It introduces distributed mechanisms for computing by the network 
itself individual ratings that incorporate rating-information from the network. Our approach 
utilizes message-passing algorithms from the domain of Gaussian graphical models. In our system, 
information remains in the network, and is never "shipped" to a centralized server for any global 
computation. Our algorithms provide each user in the network with an individualized rating 
per object (e.g., per vendor). The end-result is a local rating per user which minimizes her cost 
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function from her own rating (if exists) and, at the same time, benefits from incorporating ratings 
from her network vicinity. Our rating converges quickly to an approximate optimal value even in 
sizable networks. 

Sharing views over a social network has many advantages. By taking a peer-to-peer approach 
the user information is shared and stored only within its community. Thus, there is no need for 
trusted centralized authority, which could be biased by economic and/or political forces. Likewise, 
a user can constrain information pulling from its trusted social circle. This prevents spammers 
from penetrating the system with false recommendations. 

6.1 Our General Framework 

The social network is represented by a directed, weighted communication graph G = {V,E). 
Nodes V = {l,2,...,n} are users. Edges express social ties, where a non-negative edge weight 
Wij indicates a measure of the mutual trust between the endpoint nodes i and j. Our goal is 
to compute an output rating x G i?" to each data item (or node) where Xi is the output rating 
computed locally in node i. The vector x is a solution that minimizes some cost function. Next, 
we propose such a cost function, and show that many of the existing rating algorithms conform 
to our proposed cost function. 

We consider a single instance of the rating problem that concerns an individual item (e.g., 
a movie or a user). In practice, the system maintains ratings of many objects concurrently, but 
presently, we do not discuss any correlations across items. A straight forward generalization to 
our method for collaborative filtering, to rank m (possibly correlated) items, as done in [51]. 

In this chapter, we methodically derive the following quadratic cost function, that quantifies 
the Peer-to-Peer rating problem: 



where Xi is a desired rating computed locally by node i, y is an optional prior rating, where yi is 
a local input to node i (in case there is no prior information, y is a vector of zeros). 

We demonstrate the generality of the above cost function by proving that many of the existing 
algorithms for visual layouts, collaborative filtering and ranking of nodes in Peer-to-Peer networks 
are special cases of our general framework: 

1. Eigen-Projection method. Setting = l,Wii = l,Wij = 1 we get the Eigen-Projection 
method [52] in a single dimension, an algorithm for ranking network nodes for creating a 
intuitive visual layout. 

2. Koren's spectral layout method. Recently, Dell'Amico proposed to use Koren's visual 
layout algorithm for ranking nodes in a social network [53]. We will show that this ranking 
method is a special case of our cost function, setting: wa = deg{i). 




(6.1) 
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3. Average computation. By setting the prior inputs yi to be the input of node i and taking 
/5 oo we get Xi = l/nJ^iVi' the average value of y. This construction, Consensus 
Propagation, was proposed by Moallemi and Van-Roy in [54]. 

4. Peer-to-Peer Rating. By generalizing the Consensus Propagation algorithm above and 
supporting weighted graph edges, and finite /? values we derive a new algorithm for Peer- 
to-Peer rating [20]. 

5. Spatial Ranking. We propose a generalization of the Katz method [55], for computing a 
personalized ranking of peers in a social network. By setting i/i = l,Wii = 1 and regarding 
the weights Wij as the probability of following a link of an absorbing Markov-chain, we 
formulate the spatial ranking method [20] based on the work of [56]. 

6. Personalized PageRank. We show how the PageRank and personalized PageRank algo- 
rithms fits within our framework [57,58]. 

7. Information Centrality. In the information centrality node ranking method [59], the non- 
negative weighted graph G = (V, E) is considered as an electrical network, where edge 
weights is taken to be the electrical conductance. We show that this measure can be 
modelled using our cost function as well. 

Furthermore, we propose to utilize the Gaussian Belief Propagation algorithm (GaBP) - an 
algorithm from the probabilistic graphical models domain - for efficient and distributed compu- 
tation of a solution minimizing a single cost function drawn from our family of quadratic cost 
functions. We explicitly show the relation between the algorithm to our proposed cost function 
by deriving it from the cost function. 

Empirically, our algorithm demonstrates faster convergence than the compared algorithms, 
including conjugate gradient algorithms that were proposed in [53,51] to be used in Peer-to-Peer 
environments. For comparative study of those methods see [7]. 

6.2 Unifying Family of Quadratic Cost Functions 

We derive a family of unifying cost functions following the intuitive explanation of Koren [37]. 
His work addresses graphic visualization of networks using spectral graph properties. Recently, 
Dell'Amico proposed in [53] to use Koren's visual layout algorithm for computing a distance 
metric that enables ranking nodes in a social network. We further extend this approach by 
finding a common base to many of the ranking algorithms that are used in graph visualization, 
collaborative filtering and in Peer-to-Peer rating. Beside of the theoretical interest in finding a 
unifying framework to algorithms that were not related before, we enable also a single efficient 
distributed implementation that takes the specific cost function as input and solves it. 

Given a directed graph G = {V, E) we would like to find an output rating x G -R" to each 
item where Xj is the output rating computed locally in node i. x can be expressed as the solution 
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for the following constraint minimization problem: 

minE(x) = ^ Wij{xi - Xjf, (6.2) 

i,j&E 

Given 

Var(x) = 1, Mean(x) = 0. 

where Var(x) = ^ - Mean(x))2 , Mean(x) = ^^iXi. 

The cost function E{x) is combined of weighted sums of interactions between neighbors. 
From the one hand, "heavy" edges Wij force nodes to have a similar output rating. From the 
other hand, the variance condition prevents a trivial solution of all converging to a single 
value. In other words, we would like the rating of an item to be scaled. The value of the variance 
determines the scale of computed ratings, and is arbitrarily set to one. Since the problem is 
invariant under translation, we add also the mean constraint to force a unique solution. The 
mean is arbitrarily set to zero. 

In this chapter, we address visual layout computed for a single dimension. The work of Koren 
and Dell'Amico is more general than ours since it discusses rating in k dimensions. A possible 
future extension to this work is to extend our work to k dimensions. 

From the visual layout perspective, "stronger" edges Wij let neighboring nodes appear closer 
in the layout. The variance condition forces a scaling on the drawing. 

From the statistical mechanics perspective, the cost function E{x) is considered as the system 
energy that we would like to minimize, the weighted sums are called "attractive forces" and the 
variance constraint is a "repulsive force". 

One of the most important observations we make is that using Koren's framework, the chosen 
values of the variance and mean are arbitrary and could be changed. This is because the variance 
influences the scaling of the layout and the mean the translation of the result. Without the loss of 
generality, in some of the proofs we change the values of the mean and variance to reflect easier 
mathematical derivation. However, normalization of rated values can be always done at the end, 
if needed. Following, we generalize Koren's method to support a larger variety of cost functions. 
The main observation we have, is that the variance condition is used only for scaling the rating, 
without relating to the specific problem at hand. We propose to add a second constraint which 
is local to each node: 

'^Wiiixi - Vif + Wij{xi - Xjf. (6.3) 

Thus we allow higher flexibility, since we allow yi can be regarded as prior information in the case 
of Peer-to-Peer rating, where each node has some initial input it adds to the computation. In 
other cases, yi can be some function computed on x like yi = Yl!i=i^i °'' ^ function computed 
on the graph: y^ = Y.N{^^)■ 
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Theorem 19. The Eigen-Projection method is an instance of the cost function 6.1 when Wij = 

l,Wii = 1,(3 = 1/2, Hi = 1. 

Proof It is shown in [52] that the optimal solution to the cost function is x = L~^l where L is 
the graph Laplacian (as defined in 8). Substitute P = 1/2, wu = l,Wij = = 1 in the cost 
function (6.1) : 

mmE{x) =^l(xi- 1)^ + 1/2 '^{x.-Xj)^. 

The same cost function in linear algebra form (1 is a vector of all ones): 

min -Efx) = x"^Lx — 2x1 + n . 



Now we calculate the derivative and compare to zero and get 



VxEM = 2x^L - 21 



x = 



□ 



Theorem 20. Koren's spectral layout algorithm/Del' Amicco method in a single dimension, is an 
instance of the cost function 6.1 when wu = 1,P = l,yi = deg{i), where deg{i) = J2j(^N{i) ^ij' 
up to a translation. 

Proof Using the notations of [53] the cost function of Koren's spectral layout algorithm is: 

min ^ Wij{xi -Xjf, 
i,jeE 

s.t. ^ deg{i)x1 = n ^ ^ deg{i)xi = 0. 

i i 

We compute the weighted cost function 

>C(x,/5,7) = ^Wij{xi - Xj)'^ - [3C^deg{i)x1 - n) -'y^degi 



l]Xi . 



Substitute the weights /5 = 1,7 = 1/2 we get: 

= ^'^iji^i - ^jf - ^deg{i){xi - if, 

ij i 

Reverting to our cost function formulation we get: 

= ^deg{i){xi - if + '^Wij{xi - xjf. 

i ij 

In other words, we substitute wu = deg{i),yi = 1, P = 1 and we get Koren's formulation. □ 

It is interesting to note, that the optimal solution according to Koren's work is Xi = J2j(^N{i) deg{i) 
which is equivalent to the thin plate model image processing and PDEs [60]. 
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6.2.1 Peer-to-Peer Rating 

In [20] we have proposed to generalize the Consensus Propagation (CP) algorithm [54] to solve 
the general cost function (6.1). 

The CP algorithm is a distributed algorithm for calculating the network average, assuming 
each node has an initial value. We have extended the CP algorithm in several ways. First, in 
the original paper the authors propose to use a very large P for calculating the network average. 
As mentioned, large /3 forces all the nodes to converge to the same value. We remove this 
restriction by allowing a flexible selection of P based on the application needs. As /5 becomes 
small the calculation is done in a closer and closer vicinity of the node. 

Second, we have extended the CP algorithm to support null value, adapting it to omit the 
term — Xj)^ when yi = ±, i.e., when node i has no initial rating. This construction is reported 
in [20] and not repeated here. 

Third, we use non-uniform edge weights Wij, which in our settings represent mutual trust 
among nodes. This makes the rating local to certain vicinities, since we believe there is no 
meaning for getting a global rating in a very large network. This extension allows also asymmetric 
links where the trust assigned between neighbors is not symmetric. In that case we get an 
approximation to the original problem. 

Fourth, we introduce node weights, where node with higher weight has an increasing linear 
contribution to the output of the computation. 

The Peer-to-Peer rating algorithm was reported in detail in [20]. 

Theorem 21. The Consensus propagation algorithm is an instance of our cost function 6.1 with 

Wii = ^ oo. 

The proof is given in Chapter 11.3 
Theorem 22. The Peer-to-Peer rating algorithm is an instance of our cost function 6.1. 
The proof is given in [20]. 

6.2.2 Spatial Ranking 

In [20] we presented a new ranking method called Spatial Ranking, based on the work of Jason 
K. Johnson et al. [56]. Recently, we found out that a related method was proposed in 1953 
in the social sciences field by Leo Katz [55]. The Spatial Ranking method described below is a 
generalization of the Katz method, since it allows weighted trust values between social network 
nodes (unlike the Katz method which deals with binary relations). Furthermore, we propose a 
new efficient implementation for a distributed computation of the Spatial Ranking method. 

In our method, each node ranks itself the list of all other nodes based on its network topology 
and creates a personalized ranking of its own. 

We propose to model the network as a Markov chain with a transition matrix R, and calculate 
the fundamental matrix P, where the entry Pij is the expected number of times of a random 
walk starting from node i visits node j [61]. 
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We take the local value Pa of the fundamental matrix P, computed in node i, as node 
i's global importance. Intuitively, the value signifies the weight of infinite number of random 
walks that start and end in node i. Unlike the PageRank ranking method, we explicitly bias the 
computation towards node i, since we force the random walks to start from it. This captures the 
local importance node i has when we compute a personalized rating of the network locally by 
node i. Figure 6.1 captures this bias using a simple network of 10 nodes. 

The fundamental matrix can be calculated by summing the expectations of random walks of 
length one, two, three etc., R+R^+R'^+. . . . Assuming that the spectral radius q{R) < 1, we get 
YlvLi R^ = {I — R)^^ ■ Since R is stochastic, the inverse (/ — R)^^ does not exist. We therefore 
slightly change the problem: we select a parameter a < 1, to initialize the matrix J = I — aR 
where / is the identity matrix. We know that g(aR) < 1 and thus aR + a^R^ + a^R^ + . . . 
converges to (/ — aR)^^. 



0.05 0.08 0.11 0.12 0.13 0.13 0.12 0.11 0.08 0.05 




0.06 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.06 



Figure 6.1: Example output of the Spatial ranking (on top) vs. PageRank (bottom) over a 
network of 10 nodes. In the Spatial ranking method node rank is biased towards the center, 
where in the PageRank method, non-leaf nodes have equal rank. This can be explained by the 
fact that the sum of self-returning random walks increases towards the center. 



Theorem 23. The Spatial Ranking method is an instance of the cost function 6.1, when yi = 
0, Wii = 1, and Wij are entries in an absorbing Markov chain R. 

Proof We have shown that the fundamental matrix is equal to (J — R)^^. Assume that the 
edge weights are probabilities of Markov-chain transitions (which means the matrix is stochastic), 
substitute /5 = a,Wii = l,?/j = 1 in the cost function ( 6.1): 

min_E'(x) = 1 * (x^ — 1)^ — a Wij{xi — XjY- 

The same cost function in linear algebra form: 

min E{x) = x"^/x — ax^Rx. — 2x. 
Now we calculate the derivative and compare to zero and get 

x= {I -aRy\ 

□ 

We have shown that the Spatial Ranking algorithm fits well within our unifying family of cost 
functions. Setting y to a fixed constant, means there is no individual prior input at the nodes, 
thus we measure mainly the topological influence in ranking the nodes. In case that we use 
Wii 7^ 1 we will get a biased ranking, where nodes with higher weight have higher influence at 
result of the computation. 
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6.2.3 Personalized PageRank 

The PageRank algorithm is a fundamental algorithm in computing node ranks in the network [57]. 
The personalized PageRank algorithm is a variant described in [58]. In a nutshell, the Markov- 
chain transition probability matrix M is constructed out of the web links graph. A prior probability 
X is taken to weight the result. The personalized PageRank calculation can be computed [58]: 

PR{x) = (1 - a){I - aM)-^x, 

where a is a weighting constant which determines the speed of convergence in trade-off with the 
accuracy of the solution and / is the identity matrix. 

Theorem 24. The Personalized PageRank algorithm can be expressed using our cost function 6. 1, 
up to a translation. 

proof sketch. The proof is similar to the Spatial Ranking proof. There are two differences: the 
first is that the prior distribution x is set in y to weight the output towards the prior. Second, 
in the the Personalized PageRank algorithm the result is multiplied by the constant (1 — a), 
which we omit in our cost function. This computation can be done locally at each node after the 
algorithm terminates, since a is a known fixed system parameter. □ 

6.2.4 Information Centrality 

In the information centrality (IC) node ranking method [59], the non-negative weighted graph 
G = {V, E) is considered as an electrical network, where edge weights is taken to be the electrical 
conductance. A vector of supply h is given as input, and the question is to compute the electrical 
potentials vector p. This is done by computing the graph Laplacian and solving the set of linear 
equations Lp = b. The IC method (a.k.a current flow betweenness centrality) is defined by: 



^i^jPij{i) -PijU)' 

The motivation behind this definition is that a centrality of node is measured by inverse proportion 
to the effective resistance between a node to all other nodes. In case the effective resistance is 
lower, there is a higher electrical current flow in the network, thus making the node more "socially 
influential" . 

One can easily show that the IC method can be derived from our cost function, by calculating 
(L -|- J)~^ where L is the graph Laplacian and J is a matrix of all ones. Note that the inverted 
matrix is not sparse, unlike all the previous constructions. Hence, a special construction which 
transforms this matrix into a sparse matrix is needed. This topic is out of scope of this work. 

6.3 Experimental Results 

We have shown that various ranking methods can be computed by solving a linear system of 
equations. We propose to use the GaBP algorithm, for efficiently computing the ranking methods 
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Figure 6.2: Blog network topology 



Figure 6.3: DIMES Internet topology 





Figure 6.4: US gov documents topology Figure 6.5: MSN Messenger topology 

Figures 2-5 illustrate subgraphs taken from the different topologies plotted using the Pajek 
software [62]. Despite the different graph characteristics, the GaBP performed very well on all 
tested topologies 



distributively by the network nodes. Next, we bring simulation results which show that for very 
large networks the GaBP algorithm performs remarkably well. 

For evaluating the feasibility and efficiency of our rating system, we used several types of large 
scale social networks topologies: 

1. MSN Live Messenger. We used anonymized data obtained from Windows Live Messenger 
that represents Messenger users' buddy relationships. The Messenger network is rather large 
for simulation (over two hundred million users), and so we cut sub-graphs by starting at 
a random node, and taking a BFS cut of about one million nodes. The diameter of the 
sampled graph is five on average. 

2. Blog crawl data. We used blog crawl data of a network of about 1.5M bloggers, and 8M 
directed links to other blogs. This data was provided thanks to Elad Yom-Tov from IBM 
Research Labs, Haifa, Israel. 

3. DIMES Internet measurements. We used a snapshot of an Internet topology from 
January 2007 captured by the DIMES project [63]. The 300K nodes are routers and the 
2.2M edges are communication lines. 
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Topology 


Nodes 


Edges 


Data Source 


MSN Messenger graph 


IM 


9.4M 


Microsoft 


Blogs Web Crawl 


1.5M 


8M 


IBM 


DIMES 


300K 


2.2M 


DIMES Internet measurements 


US Government documents 


12M 


64 M 


Web research collection 



Table 6.1: Topologies used for experimentation 



4. US gov document repository. We used a crawl of 12M pdf documents of US government, 
where the links are links within the pdf documents pointing to other documents within the 
repository [64]. 

One of the interesting questions, is the practical convergence speed of our rating methods. 
The results are given using the MSN Messenger social network's topology, since all of the other 
topologies tested obtained similar convergence results. We have tested the Peer-to-Peer rating 
algorithm. We have drawn the input ratings yi and edge weights Wij in uniformly at random in the 
range [0, 1]. We have repeated this experiment with different initializations for the input rating 
and the edge weights and got similar convergence results. We have tested other cost functions, 
including PageRank and Spatial Ranking and got the same convergence results. 

Figure 6.6 shows the convergence speed of the Peer-to-Peer rating algorithm. The x-axis 
represents round numbers. The rounds are given only for reference, in practice there is no need 
for the nodes to be synchronized in rounds as shown in [65]. The y-axis represents the sum-total 
of change in ratings relative to the previous round. We can see that the node ratings converge 
very fast towards the optimal rating derived from the cost function. After only five iterations, the 
total change in nodes ratings is about 1 (which means an average change of 1 x 10^^ per node). 

6.3.1 Rating Benchmark 

For demonstrating the applicability of our proposed cost functions, we have chosen to implement 
a "benchmark" that evaluates the effectiveness of the various cost functions. Demonstrating this 
requires a quantitative measure beyond mere speed and scalability. The benchmark approach 
we take is as follows. First, we produce a ladder of "social influence" that is inferred from the 
network topology, and rank nodes by this ladder, using the Spatial ranking cost function. Next, 
we test our Peer-to-Peer rating method in the following settings. Some nodes are initialized 
with rate values, while other nodes are initialized with empty ratings. Influential nodes are given 
different initial ratings than non-influential nodes. The expected result is that the ratings of 
influential nodes should affect the ratings of the rest of the network so long as they are not vastly 
outnumbered by opposite ratings. 

As a remark, we note that we can use the social ranks additionally as trust indicators, giving 
higher trust values to edges which are incident to high ranking nodes, and vice versa. This has 
the nice effect of initially placing low trust on intruders, which by assumption, cannot appear 
influential. 
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Figure 6.6: Convergence of rating over a social network of IM nodes and 9.4M edges. Note, 
that using asynchronous rounds, the algorithm converges faster, as discussed in [65] 



For performing our benchmark tests, we once again used simulation over the 1 million nodes 
sub-graph of the Messenger network. Using the ranks produced by our spatial ranking, we selected 
the seven highest ranking nodes and assigned them an initial rating value 5. We also selected 
seven of the lowest ranking nodes and initialized them with rating value 1. All other nodes started 
with null input. The results of the rating system in this settings are given in Figure 3. After 
about ten rounds, a majority of the nodes converged to a rating very close to the one proposed 
by the influential nodes. We ran a variety of similar tests and obtained similar results in all cases 
where the influential nodes were not totally outnumbered by opposite initial ratings; for brevity, 
we report only one such test here. 

The conclusion we draw from this test is that a combination of applying our GaBP solver for 
computing first the rating of nodes, and then using this rating for choosing influential nodes and 
spreading their beliefs in the network has the desired effect of fast and influential dissemination 
of the socially influential nodes. This effect can have a lot of applications in practice, including 
targeted commercials to certain zones in the social network. 

Quite importantly, our experiment shows that our framework provide good protection against 
malicious infiltrators: Assuming that intruders have low connectivity to the rest of the network, 
we demonstrate that it is hard for them to influence the rating values in the system. Furthermore, 
we note that this property will be reinforced if the trust values on edges are reduced due to their 
low ranks, and using users satisfaction feedback. 
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Rating 



Figure 6.7: Final rating values in a network of IM nodes. Initially, 7 highest ranking nodes rate 
5 and 7 lowest ranking nodes rate 1. 
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Consider a discrete-time channel with a real input vector x = {xi, . . . ,xk}'^ and a corresponding 
real output vector y = {yi, . . . , = /{x^} ^ M^.^ Here, the function /{■} denotes the 
channel transformation. Specifically, CDMA multiuser detection problem is characterized by the 
following linear channel 

y = Sx + n , 

where S^^k is the CDMA chip sequence matrix, x G { — 1, 1}^ is the transmitted signal, y G 
is the observation vector and n is a vector of AWGN noise. The multiuser detection problem is 
stated as follows. Given S, y and knowledge about the noise, we would like to infer the most 
probable x. This problem is NP-hard. 
The matched filter output is 

S^y = Rx + S^n , 

where S^n is a K x 1 additive noise vector and Rkxk = S^S is a positive-definite symmetric 
matrix, often known as the correlation matrix. 

Typically, the binary constraint on x is relaxed to x G [—1, 1]. This relaxation is called linear 
detection. In linear detection the decision rule is 

ic = A{x*} = A{R-iS^y} . (7.1) 

The vector x* is the solution (over M) to Rx = S^y. Estimation is completed by adjusting the 
(inverse) matrix-vector product to the input alphabet, accomplished by using a proper clipping 
function A{-} {e.g. , for binary signaling A{-} is the sign function). 

Assuming linear channels with AWGN with variance cr^ as the ambient noise, the linear min- 
imum mean-square error (MMSE) detector can be described by using H + a'^Ix, known to be 
optimal when the input distribution Px is Gaussian. In general, linear detection is suboptimal 
because of its deterministic underlying mechanism {i.e. , solving a given set of linear equations), 
in contrast to other estimation schemes, such as MAP or maximum likelihood, that emerge from 
an optimization criteria. 

■"^An extension to the complex domain is straightforward. 
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The essence of detection theory is to estimate a hidden input to a channel from empirically- 
observed outputs. An important class of practical sub-optimal detectors is based on linear detec- 
tion. This class includes, for instance, the conventional single-user matched filter, the decorrelator 
(also, called the zero-forcing equalizer), the linear minimum mean-square error (LMMSE) detec- 
tor, and many other detectors with widespread applicability [66,67]. In general terms, given a 
probabilistic estimation problem, linear detection solves a deterministic system of linear equations 
derived from the original problem, thereby providing a sub-optimal, but often useful, estimate of 
the unknown input. 

Applying the GaBP solver to linear detection, we establish a new and explicit link between 
BP and linear detection. This link strengthens the connection between message-passing inference 
and estimation theory, previously seen in the context of optimal maximum a-posteriori (MAP) 
detection [68, 69] and several sub-optimal nonlinear detection techniques [70] applied in the 
context of both dense and sparse [71,72] graphical models. 

In the following experimental study, we examine the implementation of a decorrelator detector 
in a noiseless synchronous CDMA system with binary signaling and spreading codes based upon 
Gold sequences of length m = 7?- Two system setups are simulated, corresponding to n = 3 and 
n = 4 users, resulting in the cross-correlation matrices 

and 

/ 7 -1 3 3 \ 

1 -1 7 3-1 

^ 7 3 3 7 -1 

\ 3 -1 -1 7 / 

respectively.^ 

The decorrelator detector, a member of the family of linear detectors, solves a system of 
linear equations, Rx = S^y, thus the vector of decorrelator decisions is determined by taking 
the signum of the vector R^^S^y. Note that R3 and R4 are not strictly diagonally dominant, 
but their spectral radius are less than unity, since p(|l3 — R3I) = 0.9008 < 1 and p(|l4 — R4I) = 
0.8747 < 1, respectively. In all of the experiments, we assumed the output vector was the all-ones 
vector. 

Table 7.1 compares the proposed GaBP algorithm with standard iterative solution meth- 
ods [2] (using random initial guesses), previously employed for CDMA multiuser detectors (MUD). 
Specifically, MUD algorithms based on the algorithms of Jacobi, Gauss-Seidel (GS) and (opti- 
mally weighted) successive over-relaxation (SOR)'^ were investigated [11,12]. The table lists the 



(7.3) 



^In this case, as long as the system is not overloaded, i.e. the number of active users n is not greater than the 
spreading code's length m, the decorrelator detector yields optimal detection decisions. 

^These particular correlation settings were taken from the simulation setup of Yener et al. [13]. 

^This moving average improvement of Jacobi and GS algorithms is equivalent to what is known in the BP 
literature as 'damping' [73]. 
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Algorithm 


Iterations t (R3) 


Iterations t (R4) 


Jacobi 


111 


24 


GS 


26 


26 


Parallel GaBP 


23 


24 


Optimal SOR 


17 


14 


Serial GaBP 


16 


13 



Table 7.1: Decorrelator for K = 3,4-user, N = 7 Gold code CDMA. Total number of iterations 
required for convergence (threshold e = 10"^) for GaBP-based solvers vs. standard methods. 




Iteration t Iteration t 

Figure 7.1: Convergence of the two gold CDMA matrices. To the left R3, to the right, R4. 



convergence rates for the two Gold code-based CDMA settings. Convergence is identified and 
declared when the differences in all the iterated values are less than 10~^. We see that, in compar- 
ison with the previously proposed detectors based upon the Jacobi and GS algorithms, the GaBP 
detectors converge more rapidly for both n = 3 and n = 4. The serial (asynchronous) GaBP 
algorithm achieves the best overall convergence rate, surpassing even the SOR-based detector. 

Further speed-up of GaBP can be achieved by adapting known acceleration techniques from 
linear algebra, such Aitken's method and Steffensen's iterations, as explained in Section 3.2. 
Table 7.2 demonstrates the speed-up of GaBP obtained by using these acceleration methods. 
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Algorithm 


R3 


R.4 


Jacobi+Steffensen^ 


59 




Parallel GaBP+Steffensen 


13 


13 


Serial GaBP+Steffensen 


9 


7 



Table 7.2: Decorrelator for K = 3,4-user, N = 7 Gold code CDMA. Total number of itera- 
tions required for convergence (threshold e = 10"^) for Jacobi, parallel and serial GaBP solvers 
accelerated by Steffensen iterations. 

in comparison with that achieved by the similarly modified Jacobi algorithm.^ We remark that, 
although the convergence rate is improved with these enhanced algorithms, the region of conver- 
gence of the accelerated GaBP solver remains unchanged. 

For the algorithms we examined. Figure 7.1 displays the Euclidean distance between the 
tentative (intermediate) results and the fixed-point solution as a function of the number of 
iterations. As expected, all linear algorithms exhibit a logarithmic convergence behavior. Note 
that GaBP converges faster on average, although there are some fluctuations in the GaBP curves, 
in contrast to the monotonicity of the other curves. 

An interesting question concerns the origin of this convergence speed-up associated with 
GaBP. Better understanding may be gained by visualizing the iterations of the different methods 
for the matrix R3 case. The convergence contours are plotted in the space of {xi,X2,X3} 
in Fig. 7.3. As expected, the Jacobi algorithm converges in zigzags towards the fixed point 
(this behavior is well-explained in Bertsekas and Tsitsiklis [46]). The fastest algorithm is serial 
GaBP. It is interesting to note that GaBP convergence is in a spiral shape, hinting that despite 
the overall convergence improvement, performance improvement is not guaranteed in successive 
iteration rounds. In this case the system was simulated with a specific R matrix for which Jacobi 
algorithm and other standard methods did not even converge. Using Aitken's method, a further 
speed-up in GaBP convergence was obtained. 

Despite the fact that the examples considered correspond to small multi-user systems, we 
believe that the results reflect the typical behavior of the algorithms, and that similar qualitative 
results would be observed in larger systems. In support of this belief, we note, in passing, that 
GaBP was experimentally shown to converge in a logarithmic number of iterations in the cases of 
very large matrices both dense (with up to hundreds of thousands of dimensions [75]) and sparse 
(with up to millions of dimensions [20]). 

As a final remark on the linear detection example, we mention that, in the case of a channel 
with Gaussian input signals, for which linear detection is optimal, the proposed GaBP scheme 
reduces to the BP-based MUD scheme, recently introduced by Montanari et al. [50], as shown in 

^Application of Aitken and Steffensen's methods for speeding-up the convergence of standard (non-BP) iter- 
ative solution algorithms in the context of MUD was introduced by Leibig et al. [74]. 
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Figure 7.2: Convergence acceleration of the GaBP algorithm using Aitken and Steffensen meth- 
ods. The left graph depicts a 3 x 3 gold CDMA matrix, the right graph 4x4 gold CDMA 
matrix. 




Figure 7.3: Convergence of the GaBP algorithm vs. Jacobi on a 3 x 3 gold CDMA matrix. Each 
dimension shows one coordinate of the solution. Jacobi converges in zigzags while GaBP has 
spiral convergence. 



Chapter 11.1. Montanari's BP scheme, assuming a Gaussian prior, has been proven to converge 
to the MMSE (and optimal) solution for any arbitrarily loaded, randomly-spread CDMA system 
{i.e. , a system where p(|I„— R|) ^ l).^Thus Gaussian-input additive white Gaussian noise CDMA 

^For non-Gaussian signaling, e.g'. with binary input alphabet, this BP-based detector is conjectured to converge 
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is another example for which the proposed GaBP solver converges to the MAP decisions for any 
m X n random spreading matrix S, regardless of the spectral radius. 



7.1 Extending GaBP to support non-square matrices 

in the previous section, linear detection has been explicitly linked to BP [7], using a Gaussian belief 
propagation (GaBP) algorithm. This allows for an efficient iterative computation of the linear 
detector [14], circumventing the need of, potentially cumbersome, direct matrix inversion (via, 
e.g., Gaussian elimination). The derived iterative framework was compared quantitatively with 
'classical' iterative methods for solving systems of linear equations, such as those investigated 
in the context of linear implementation of CDMA demodulation [11,12,13]. GaBP is shown to 
yield faster convergence than these standard methods. Another important work is the BP-based 
MUD, recently derived and analyzed by Montanari et al. [50] for Gaussian input symbols. 

There are several drawbacks to the linear detection technique presented earlier [7]. First, the 
input matrix R„xn = ^nxk^kxn (the chip correlation matrix) needs to be computed prior to 
running the algorithm. This computation requires ri^k operations. In case where the matrix S 
is sparse [72], the matrix R might not no longer be sparse. Second, GaBP uses memory to 
store the messages. For a large n this could be prohibitive. 

In this section, we propose a new construction that addresses those two drawbacks. In our 
improved construction, given a non-rectangular CDMA matrix S„xa,-, we compute the MMSE 
detector x = (S^S + ^)^^S^y, where ^ = cr~^I is the AWGN diagonal inverse covariance 
matrix. We utilize the GaBP algorithm which is an efficient iterative distributed algorithm. The 
new construction uses only 2nk memory for storing the messages. When k <^ n this represents 
significant saving relative to the 2?t,^ in our previously proposed algorithm. Furthermore, we do 
not explicitly compute S^S, saving an extra n'^k overhead. 

In Chapter 11.1 we show that Montanari's algorithm [50] is an instance of GaBP. By showing 
this, we are able to prove new convergence results for Montanari's algorithm. Montanari proves 
that his method converges on normalized random-spreading CDMA sequences, assuming Gaussian 
signaling. Using binary signaling, he conjectures convergence to the large system limit. Here, 
we extend Montanari's result, to show that his algorithm converges also for non-random CDMA 
sequences when binary signaling is used, under weaker conditions. Another advantage of our work 
is that we allow different noise levels per bit transmitted. 

7.1.1 Distributed Iterative Computation of the MMSE Detector 

In this section, we efficiently extend the applicability of the proposed GaBP-based solver for 
systems with symmetric matrices [7] to systems with any square {i.e., also nonsymmetric) or 
rectangular matrix. We first construct a new symmetric data matrix R based on an arbitrary 

only in the large-system limit, as n,m ^ oo [50]. 
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(non-rectangular) matrix S G R'^^" 

^ " ( S S ) ^ ^{k+n)xik+n)_ (7 4) 

Additionally, we define a new vector of variables x = {x'^,z'^}^ G M('=+")^\ where x G R''^^ 
is the (to be shown) solution vector and z G M"^^ is an auxiliary hidden vector, and a new 
observation vector y = {0^, y^}^ G M('=+")^i. 

Now, we would like to show that solving the symmetric linear system Rx = y and taking 
the first k entries of the corresponding solution vector x is equivalent to solving the original (not 
necessarily symmetric) system Rx = y. Note that in the new construction the matrix R is sparse 
again, and has only 2nk off-diagonal nonzero elements. When running the GaBP algorithm we 
have only 2nk messages, instead of n"^ in the previous construction. 

Writing explicitly the symmetric linear system's equations, we get 

X + S^z = 0, Sx - ^z = y. 

Thus, 

x = vl>-iS^(y-Sx), 

and extracting x we have 

x=(S^S + *)-iSV 

Note, that when the noise level is zero, \E' = Omxm, we get the Moore-Pen rose pseudoinverse 
solution 

x=(S^S)-^S^y = Sty. 



7.1.2 Relation to Factor Graph 

In this section we give an alternate proof of the correctness of our construction. Given the inverse 
covariance matrix R defined in (7.4), and the shift vector x we can derive the matching self and 
edge potentials 

(t>i{xi) = exp{-l/2xiRlxi - XiHi) , 
which is a factorization of the Gaussian system distribution 

i i,j i<k i>k i,j 



/ ' _^ S ^ ' '■I] 

i<k i>k i,j 
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CDMA channel linear transformation 




Figure 7.4: Factor graph describing the linear channel 



Next, we show the relation of our construction to a factor graph. We will use a factor graph 
with k nodes to the left (the bits transmitted) and n nodes to the right (the signal received), 
shown in Fig 7.4. Using the definition x = {x-^,z^}^ G ]R('^'+")^^ the vector x represents the k 
input bits and the vector z represents the signal received. Now we can write the system probability 
as: 

p(x) oc [ 7Vr(x; 0, 1)J\f{z; ^)rfx . 
Jst 

It is known that the marginal distribution over z is: 

= Ar(z;0,S^S + ^). 
The marginal distribution is Gaussian, with the following parameters: 

£;(z|x) = (S^S + ^)''S^y, 

Cov{z\k) = (S^S + ^)~^ 

It is interesting to note that a similar construction was used by Frey [76] in his seminal 1999 
work when discussing the factor analysis learning problem. Comparison to Frey's work is found 
in Chapter 11.2. 



7.1.3 Convergence Analysis 

In this section we characterize the convergence properties of our linear detection method base 
on GaBP. We know that if the matrix R is strictly diagonally dominant, then GaBP converges 
and the marginal means converge to the true means [8, Claim 4]. Noting that the matrix R is 
symmetric, we can determine the applicability of this condition by examining its columns. As 
shown in Figure 7.5 we see that in the first k columns, we have the k CDMA sequences. We 
assume random-spreading binary CDMA sequences where the total system power is normalized to 
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Figure 7.5: An example randomly spreading CDMA sequences matrices using our new construc- 
tion. Using this illustration, it is easy to give a sufficient convergence proof to the GaBP algorithm. 
Namely, when the above matrix is diagonally dominant. 



one. In other words, the absolute sum of each column is 1. By adding e to the main diagonal, we 
insure that the first k columns are diagonally dominant. In the next n columns of the matrix R, 
we have the diagonal covariance matrix ^ with different noise levels per bit in the main diagonal, 
and zero elsewhere. The absolute sum of each column of S is k/n, thus when the noise level 
of each bit satisfies '^i > k/n, we have a convergence guarantee. Note, that the convergence 
condition is a sufficient condition. Based on Montanari's work, we also know that in the large 
system limit, the algorithm converges for binary signaling, even in the absence of noise. 

In Chapter 11.1 we prove that Montanari's algorithm is an instance of our algorithm, thus 
our convergence results apply to Montanari's algorithm as well. 



7.2 Applying GaBP Convergence Fix 

Next, we apply our novel double loop technique described in Chapter 5, for forcing the convergence 
of our linear detection algorithm using GaBP. We use the following setting: given a random- 
spreading CDMA code^ with chip sequence length n = 256, and A; = 64 users. We assume a 
diagonal AWGN with = 1. Matlab code of our implementation is available on [77]. 

Using the above settings, we have drawn at random random-spreading CDMA matrix. Typi- 
cally, the sufficient convergence conditions for the GaBP algorithm do not hold. For example, we 
have drawn at random a randomly-spread CDMA matrix with p{\Ik — C^|) = 4.24, where 
is a diagonally-normalized version of (C -|- a'^Ix)- Since p{\Ik — C^\) > 1, the GaBP algorithm 
for multiuser detection is not guaranteed to converge. 

Figure 7.6 shows that under the above settings, the GaBP algorithm indeed diverged. The x- 
axis represent iteration number, while the values of different Xi are plotted using different colors. 
This figure depicts well the fluctuating divergence behavior. 

® Randomly-spread CDMA code is a code where the matrix S is initialized uniformly with the entries ±i. 
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Figure 7.6: Divergence of the GaBP algorithm for the multiuser detection problem, when n = 

256, A; = 64. 

Next, we deployed our proposed construction and used a diagonal loading to force convergence. 
Figure 7.7 shows two different possible diagonal loadings. The x-axis shows the Newton step 
number, while the y-axis shows the residual. We experimented with two options of diagonal 
loading. In the first, we forced the matrix to be diagonally-dominant (DD). In this case, the 
spectral radius p = 0.188. In the second case, the matrix was not DD, but the spectral radius 
was p = 0.388. Clearly, the Newton method converges faster when the spectral radius is larger. 
In both cases the inner iterations converged in five steps to an accuracy of 10~^. 

Convergence of fixed GaBP iteration witti n=256,l<=64 
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Figure 7.7: Convergence of the fixed GaBP iteration under the same settings {n = 256, k = 64). 

The tradeoff between the amount of diagonal weighting to the total convergence speed is 
shown in Figures 3,4. A CDMA multiuser detection problem is shown {k = 128, n = 256). 
Convergence threshold for the inner and outer loops where 10"^ and lO^'^. The x-axis present 
the amount of diagonal weighting normalized such that 1 is a diagonally-dominant matrix, y- 
axis represent the number of iterations. As expected, the outer loop number of iterations until 
convergence grows with 7. In contrast, the average number of inner loop iterations per Newton 
step (Figure 4) tends to decrease as 7 increases. The total number of iterations (inner x outer) 
represents the tradeoff between the inner and outer iterations and has a clear global minima. 
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Diagonal weigting vs. num of iterations Diagonal weigting vs. inner loop iterations 




Diagonal weighting y Diagonal weighting y 



Figure 7.8: Effect of diagonal weighting on outer Figure 7.9: Effect of diagonal weighting on inner 
loop convergence speed. loop convergence speed. 
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Chapter 8 

Support Vector Regression 



In this chapter, we introduce a distributed support vector regression solver based on the Gaussian 
Belief Propagation (GaBP) algorithm. We improve on the original GaBP algorithm by reducing 
the communication load, as represented by the number of messages sent in each optimization 
iteration, from to n aggregated messages, where n is the number of data points. Previously, 
it was known that the GaBP algorithm is very efficient for sparse matrices. Using our novel 
construction, we demonstrate that the algorithm exhibits very good performance for dense ma- 
trices as well. We also show that the GaBP algorithm can be used with kernels, thus making the 
algorithm more powerful than previously possible. 

Using extensive simulation we demonstrate the applicability of our protocol vs. the state-of- 
the-art existing parallel SVM solvers. Using a Linux cluster of up to a hundred machines and 
the IBM Blue Gene supercomputer we managed to solve very large data sets up to hundreds of 
thousands data point, using up to 1,024 CPUs working in parallel. Our comparison shows that 
the proposed algorithm is just as accurate as these solvers, while being significantly faster. 

We start by giving on overview of the related SVM problem. 

8.1 Classification Using Support Vector Machines 

We begin by formulating the SVM problem. Consider a training set: 

Z} = {(x„y,), z = l,...,iV, X, gM™, ^^€{-1,1}}. (8.1) 

The goal of the SVM is to learn a mapping from Xj to yi such that the error in mapping, as 
measured on a new dataset, would be minimal. SVMs learn to find the linear weight vector that 
separates the two classes so that 

I/, (xi ■ w + 6) > 1 for z = 1, . . . , iV. (8.2) 

There may exist many hyperplanes that achieve such separation, but SVMs find a weight 
vector w and a bias term h that maximize the margin 2/||w||. Therefore, the optimization 
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problem that needs to be solved is 



min Jd(w) = ^ || w|| , (8.3) 



subject to (xi ■ w + 6) > 1 for i = l,...,N. (8.4) 

Points lying on the hyperplane ?/j (xi ■ w + 6) = 1 are called support vectors. 
If the data cannot be separated using a linear separator, a slack variable > is introduced 
and the constraint is relaxed to: 

(xi ■ w + 6) > 1 - for ^ = 1, . . . , iV. (8.5) 

The optimization problem then becomes: 

N 

minJ^(w) = i||w||^ + C^e*, (8.6) 

i=l 

subject to t/j (xi ■ w + 6) > 1 for i = 1, . . . , N, (8.7) 
6>0 for z = l,...,iV. (8.8) 

The weights of the linear function can be found directly or by converting the problem into its 
dual optimization problem, which is usually easier to solve. 

Using the notation of Vijayakumar and Wu [78], the dual problem is thus: 

max Lr){h) = ^h,- Dh , (8.9) 

i 

subject to 0<hi<C, i=l,...,A^, (8.10) 

T.,h,y^ = Q, (8.11) 

where D is a matrix such that Dij = ViVjK (xj, Xj) and K (-, ■) is either an inner product of the 
samples or a function of these samples. In the latter case, this function is known as the kernel 
function, which can be any function that complies with the Mercer conditions [79]. For example, 
these may be polynomial functions, radial-basis (Gaussian) functions, or hyperbolic tangents. 
If the data is not separable, C is a tradeoff between maximizing the margin and reducing the 
number of misclassifications. 

The classification of a new data point is then computed using the following equation: 

/ (x) = sign I ^ KyiK (x^, x)+h\ (8-12) 
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8.2 Kernel Ridge Regression problem 

Kernel Ridge Regression (KRR) implements a regularized form of the least squares method useful 
for both regression and classification. The non-linear version of KRR is similar to Support- Vector 
Machine (SVM) problem. However, in the latter, special emphasis is given to points close to the 
decision boundary, which is not provided by the cost function used by KRR. 
Given training data 

P = {x„y,}U, X, G i?'^ G i?, 

the KRR algorithm determines the parameter vector w G i?'^ of a non-linear model (using the 
"kernel trick"), via minimization of the following objective function: [75]: 

I 

min A| |w| |^ + ^iVi - w^$(xi))^ , 

i=l 

where A is a tradeoff parameter between the two terms of the optimization function, and $() is 
a (possible non-linear) mapping of the training patterns. 

One can show that the dual form of this optimization problem is given by: 

max W{a) = y^a + ^Aa^Ka - \a^a , (8.13) 

where K is a matrix whose (i, j)-th entry is the kernel function Kj = $(xj)^$(xj). 
The optimal solution to this optimization problem is: 

a = 2A(K + AI)~V 
The corresponding prediction function is given by: 

/(x) = w^$(x) = y^(K + AI)-iK(x„ x). 

8.3 Previous Approaches for Solving Parallel SVMs 

There are several main methods for finding a solution to an SVM problem on a single-node 
computer. (See [79, Chapter 10]) for a taxonomy of such methods.) However, since solving an 
SVM is quadratic in time and cubic in memory, these methods encounter difficulty when scaling 
to datasets that have many examples and support vectors. The latter two are not synonymous. 
A large dataset with many repeated examples might be solved using sub-sampling approaches, 
while a highly non-separable dataset with many support vectors will require an altogether different 
solution strategy. The literature covers several attempts at solving SVMs in parallel, which allow 
for greater computational power and larger memory size. In Collobert et al. [80] the SVM solver is 
parallelized by training multiple SVMs, each on a subset of the training data, and aggregating the 
resulting classifiers into a single classifier. The training data is then redistributed to the classifiers 
according to their performance and the process is iterated until convergence is reached. The 
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need to re-divide the data among the SVM classifiers implies that the data must be exchanged 
between nodes several times; this rules out the use of an approach where bandwidth is a concern. 
A more low-level approach is taken by Zanghirati et al. [81], where the quadratic optimization 
problem is divided into smaller quadratic programs, each of which is solved on a different node. 
The results are aggregated and the process is repeated until convergence. The performance of 
this method has a strong dependence on the caching architecture of the cluster. Graf et al. [82] 
partition the data and solve an SVM for each partition. The support vectors from each pair of 
classifiers are then aggregated into a new training set for which an SVM is solved. The process 
continues until a single classifier remains. The aggregation process can be iterated, using the 
support vectors of the final classifier in the previous iteration to seed the new classifiers. One 
problem with this approach is that the data must be repeatedly shared between nodes, meaning 
that once again the goal of data distribution cannot be attained. The second problem, which 
might be more severe, is that the number of possible support vectors is restricted by the capacity 
of a single SVM solver. Yom Tov [83] proposed modifying the sequential algorithm developed 
in [78] to batch mode. In this way, the complete kernel matrix is held in distributed memory and 
the Lagrange multipliers are computed iteratively. This method has the advantage that it can 
efficiently solve difficult SVM problems that have many support vectors to their solution. Based 
on that work, we show how an SVM solution can be obtained by adapting a Gaussian Belief 
Propagation algorithm to the solution of the algorithm proposed in [78]. 

Recently, Hazan et al. proposed an iterative algorithm for parallel decomposition based on 
Fenchel Duality [84]. Zanni et a/, propose a decomposition method for computing SVM in paral- 
lel [85]. We compare our running time results to both systems in Section 8.5. 

For our proposed solution, we take the exponent of dual SVM formulation given in equation 
(8.9) and solve maxexp(Lo(/i)). Since exp{L£){h)) is convex, the solution of maxexp(Lo(/i)) 
is a global maximum that also satisfies max Lu{h) since the matrix D is symmetric and positive 
definite. Now we can relate to the new problem formulation as a probability density function, 
which is in itself Gaussian: 

p{h) oc exp{-^h^Dh + h^l), 

where 1 is a vector of (1, 1, ■ ■ ■ , 1), and find the assignment of /i = argmaxp(/i). To solve the 
inference problem, namely computing the marginals h, we propose using the GaBP algorithm, 
which is a distributed message passing algorithm. We take the computed h as the Lagrange 
multiplier weights of the support vectors of the original SVM data points and apply a threshold 
for choosing data points with non-zero weight as support vectors. 

Note that using this formulation we ignore the remaining constraints (8.10), (8.11). In other 
words we do not solve the SVM problem, but the unconstrained kernel ridge regression problem 
(8.13). Nevertheless, empirical results presented in Chapter 8.5 show that we achieve a very good 
classification vs. state-of-the-art SVM solvers. 

Finally, following [78], we remove the explicit bias term b and instead add another dimension to 
the pattern vector Xj such that Xi = (xi, X2, ■ ■ ■ , x^, A), where A is a scalar constant. The mod- 
ified weight vector, which incorporates the bias term, is written as w = {wi,W2, ■ ■ ■ ,WN,b/ X). 
However, this modification causes a change to the optimized margin. Vijayakumar and Wu [78] 
discuss the effect of this modification and reach the conclusion that "setting the augmenting term 
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to zero (equivalent to neglecting the bias term) in high dimensional kernels gives satisfactory re- 
sults on real world data". We did not completely neglect the bias term and in our experiments, 
which used the Radial Basis Kernel, we set it to as proposed in [83]. 

8.4 Our novel construction 

We propose to use the GaBP algorithm for solving the SVR problem (8.13). In order to force the 
algorithm to converge, we artificially weight the main diagonal of the kernel matrix D to make 
it diagonally dominant. Section 8.5 outlines our empirical results showing that this modification 
did not significantly affect the error in classifications on all tested data sets. 

A partial justification for weighting the main diagonal is found in [75]. In the 2-Norm soft 
margin formulation of the SVM problem, the sum of squared slack variables is minimized: 

mill II w||2 + CSj^j^ 
such that yi{w ■ Xj + 6) > 1 — 

The dual problem is derived: 

W{h) = Eihi - ^T.ijyiyjhihj{xi ■ Xj + -^Sij), 

where 5ij is the Kronecker 5 defined to be 1 when i = j, and zero elsewhere. It is shown that the 
only change relative to the 1-Norm soft margin SVM is the addition of 1/C to the diagonal of the 
inner product matrix associated with the training set. This has the effect of adding 1/C to the 
eigenvalues, rendering the kernel matrix (and thus the GaBP problem) better conditioned [75]. 

One of the desired properties of a large scale algorithm is that it should converge in asyn- 
chronous settings as well as in synchronous settings. This is because in a large-scale communica- 
tion network, clocks are not synchronized accurately and some nodes may be slower than others, 
while some nodes experience longer communication delays. 

Corollary 25. Assuming one of the convergence conditions (Theorems 14, 15) holds, the GaBP 
algorithm convergence using serial (asynchronous) scheduling as well. 

Proof. The quadratic Min-Sum algorithm [6] provides a convergence proof in the asynchronous 
case. In Chapter 11.4 we show equivalence of both algorithms. Thus, assuming one of the 
convergence conditions holds, the GaBP algorithm converges using serial scheduling as well. □ 

8.5 Experimental Results 

We implemented our proposed algorithm using approximately 1,000 lines of code in C. We imple- 
mented communication between the nodes using the MPICH2 message passing interface.-^ Each 
node was responsible for d data points out of the total n data points in the dataset. 

■"■littp : //www-unix .mcs . anl . gov/mpi/mpich/ 
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Dataset 


Dimension 


Train 


Test 


Error (%) 


GaBP 


Sequential 


SVMIight 


Isolet 


617 


6238 


1559 


7.06 


5.84 


49.97 


Letter 


16 


20000 




2.06 


2.06 


2.3 


Mushroom 


117 


8124 




0.04 


0.05 


0.02 


Nursery 


25 


12960 




4.16 


5.29 


0.02 


Pageblocks 


10 


5473 




3.86 


4.08 


2.74 


Pen digits 


16 


7494 


3498 


1.66 


1.37 


1.57 


Spambase 


57 


4601 




16.3 


16.5 


6.57 



Table 8.1: Error rates of the GaBP solver versus those of the parallel sequential solver and 
SVMIight. 



Dataset 


Run times (sec) 




GaBP 


Sequential 


Isolet 


228 


1328 


Letter 


468 


601 


Mushroom 


226 


176 


Nursery 


221 


297 


Pageblocks 


26 


37 


Pen digits 


45 


155 


Spambase 


49 


79 



Table 8.2: Running times (in seconds) of the GaBP solver (working in a distributed environment) 
compared to that of the IBM parallel solver. 



Our implementation used synchronous communication rounds because of MP! limitations. In 
Section 8.6 we further elaborate on this issue. 

Each node was assigned several examples from the input file. Then, the kernel matrix D 
was computed by the nodes in a distributed fashion, so that each node computed the rows of 
the kernel matrix related to its assigned data points. After computing the relevant parts of the 
matrix D, the nodes weighted the diagonal of the matrix D, as discussed in Section 8.4. Then, 
several rounds of communication between the nodes were executed. In each round, using our 
optimization, a total of n sums were calculated using MPLAIIreduce system call. Finally, each 
node output the solution x, which was the mean of the input Gaussian that matched its own data 
points. Each Xj signified the weight of the data point i for being chosen as a support vector. 

To compare our algorithm performance, we used two algorithms: Sequential SVM (SVMSeq) 
[78] and SVMIight [86]. We used the SVMSeq implementation provided within the IBM Parallel 
Machine Learning (PML) toolbox [87]. The PML implements the same algorithm by Vijaykumar 
and Wu [78] that our GaBP solver is based on, but the implementation in through a master-slave 
architecture as described in [83]. SVMIight is a single computing node solver. 
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Figure 8.1: Speedup of tlie GaBP algoritlim vs. 2 CPUS. 



Table 8.1 describes tlie seven datasets we used to compare the algorithms and the classification 
accuracy obtained. These computations were done using five processing nodes (3.5GHz Intel 
Pentium machines, running the Linux operating system) for each of the parallel solvers. All 
datasets were taken from the UCI repository [88]. We used medium-sized datasets so that run- 
times using SVMIight would not be prohibitively high. All algorithms were run with an RBF 
kernel. The parameters of the algorithm (kernel width and misclassification cost) were optimized 
using a line-search algorithm, as detailed in [89]. 

Note that SVMIight is a single node solver, which we use mainly as a comparison for the 
accuracy in classification. 

Using the Friedman test [90], we did not detect any statistically significant difference between 
the performance of the algorithms with regards to accuracy {p < 0.10^'^). 

Figure 8.1 shows the speedup results of the algorithm when running the GaBP algorithm on 
a Blue Gene supercomputer. The speedup with N nodes is computed as the run time of the 
algorithm on a single node, divided by the run time using N nodes. Obviously, it is desirable 
to obtain linear speedup, i.e., doubling computational power halves the processing time, but this 
is limited by the communication load and by parts of the algorithm that cannot be parallelized. 
Since Blue Gene is currently limited to 0.5 GB of memory at each node, most datasets could not 
be executed on a single node. We therefore show speedup compared to two nodes. As the figure 
shows, in most cases we get a linear speedup up to 256 CPUs, which means that the running time 
is linearly proportional to one over the number of used CPUs. When using 512 - 1024 CPUs, the 
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Dataset 


Dim 


Num of examples 


Run time GaBP (sec) 


Run time [85] (sec) 


Run time [84] 


Covert ype 


54 


150,000/300,000 


468 


24365 


16742 


MNIST 


784 


60,000 


756 


359 


18 



Table 8.3: Running times of the GaBP solver for large data sets using 1024 CPUs on an IBM 
Blue Gene supercomputer. Running time results are compared to two state-of-the-art solvers: 
[85] and [84]. 



communication overhead reduces the efficiency of the parallel computation. We identified this 
problem as an area for future research into optimizing the performance for larger scale grids. 

We also tested the ability to build classifiers for larger datasets. Table 8.3 shows the run times 
of the GaBP algorithm using 1024 CPUs on two larger datasets, both from the UCI repository. 
This demonstrates the ability of the algorithm to process very large datasets in a reasonably 
short amount of time. We compare our running time to state-of-the-art parallel decomposition 
method by Zanni et al. [85] and Hazan et al. . Using the MNIST dataset we where considerably 
slower by a factor of two, but in the larger Covertype dataset we have a superior performance. 
Note that the reported running times should be taken with a grain of salt, since the machines 
used for experimentation are different. Zanni used 16 Pentium IV machines with 16Gb memory, 
Hazan used 10 Pentium IV machines with 4Gb memory, while we used a larger number of weaker 
Pentium III machines with 400Mb of memory. Furthermore, in the Covertype dataset we used 
only 150,000 data points while Zanni and Hazan used the full dataset which is twice larger. 

8.6 Discussion 

In this chapter we demonstrated the application of the Gaussian Belief Propagation to the solution 
ofSVM problems. Our experiments demonstrate the usefulness of this solver, being both accurate 
and scalable. 

We implemented our algorithm using a synchronous communication model mainly because 
MPICH2 does not support asynchronous communication. While synchronous communication is 
the mode of choice for supercomputers such as Blue Gene, in many cases such as heterogeneous 
grid environments, asynchronous communication will be preferred. We believe that the next 
challenging goal will be to implement the proposed algorithm in asynchronous settings, where 
algorithm rounds will no longer be synchronized. 

Our initial experiments with very large sparse kernel matrices (millions of data points) show 
that asynchronous settings converge faster. Recent work by Elidan et al. [65] supports this claim 
by showing that in many cases the BP algorithm converges faster in asynchronous settings. 

Another challenging task would involve scaling to data sets of millions of data points. Cur- 
rently the full kernel matrix is computed by the nodes. While this is effective for problems with 
many support vectors [83], it is not required in many problems that are either easily separable or 
where the classification error is less important compared to the time required to learn the mode. 
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Thus, solvers scaling to much larger datasets may have to diverge from the current strategy of 
computing the full kernel matrix and instead sparsify the kernel matrix as is commonly done in 
single node solvers. 

Finally, it remains an open question whether SVMs can be solved efficiently in Peer-to-Peer 
environments, where each node can (efficiently) obtain data from only several close peers. Future 
work will be required in order to verify how the GaBP algorithm performs in such an environment, 
where only partial segments of the kernel matrix can be computed by each node. 
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Distributed Computation of Kalman 
Filter 

In Chapter 7 we show how to compute efficiently and distributively the MMSE prediction for the 
multiuser detection problem, using the Gaussian Belief Propagation (GaBP) algorithm. The basic 
idea is to shift the problem from the linear algebra domain into a probabilistic graphical model, 
solving an equivalent inference problem using the efficient belief propagation inference engine. 

In this chapter, we propose to extend the previous construction, and show, that by performing 
the MMSE computation twice on the matching inputs we are able to compute several algorithms: 
Kalman filter, Gaussian information bottleneck and the Affine-scaling interior point method. We 
reduce the discrete Kalman filter computation [91] to a matrix inversion problem and show how 
to solve it using the GaBP algorithm. We show that Kalman filter iteration that is composed from 
prediction and measurement steps can be computed by two consecutive MMSE predictions. We 
explore the relation to Gaussian information bottleneck (GIB) [92] and show that Kalman filter 
is a special instance of the GIB algorithm, when the weight parameter /3 = 1. To the best of our 
knowledge, this is the first algorithmic link between the information bottleneck framework and 
linear dynamical systems. We discuss the connection to the Affine-scaling interior-point method 
and show it is an instance of the Kalman filter. 

Besides of the theoretical interest of linking compression, estimation and optimization to- 
gether, our work is highly practical, since it proposes a general framework for computing all of 
the above tasks distributively in a computer network. This result can have many applications in 
the fields of estimation, collaborative signal processing, distributed resource allocation, etc. 

A closely related work is [39]. In this work, Frey et al. focus on the belief propagation algorithm 
(a.k.a sum-product algorithm) using factor graph topologies. They show that the Kalman filter 
algorithm can be computed using belief propagation over a factor graph. In this contribution 
we extend their work in several directions. First, we extend the computation to vector variables 
(relative to scalar variables in Frey's work). Second, we use a different graphical model: an 
undirected graphical model, which results in simpler update rules, where Frey uses factor-graph 
with two types of messages: factor to variables and variables to factors. Third and most important, 
we allow an efficient distributed calculation of the Kalman filter steps, where Frey's algorithm is 
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centralized. 

Another related work is [93]. In this work the link between Kalman filter and linear program- 
ming is established. In this thesis we propose a new and different construction that ties the two 
algorithms. 



9.1 Kalman Filter 

The Kalman filter is an efficient iterative algorithm to estimate the state of a discrete-time 
controlled process x E that is governed by the linear stochastic difference equation-^: 

Xk = Axk-i + Buk-i + Wk-i , (9.1) 

with a measurement z G R"^ that is Zk = Hxk + Vk- The random variables Wk and Vk that 
represent the process and measurement AWGN noise (respectively). p{w) ~ A/'(0, Q), p(f ) ~ 
A/'(0, R). We further assume that the matrices A, H, B, Q, R are given. ^ 
The discrete Kalman filter update equations are given by [91]: 
The prediction step: 

x^ = Axk^i + Buk^i, (9-2a) 
P," = APk-iA^ + Q. (9.2b) 

The measurement step: 

Kk = P^H^{HP-H^ + R)-\ (9.3a) 
Xk = xl + Kk{zk- Hx'^), (9.3b) 
Pk = {I-KkH)P~. (9.3c) 

where I is the identity matrix. 

The algorithm operates in rounds. In round k the estimates Kk,Xk,Pk are computed, incor- 
porating the (noisy) measurement obtained in this round. The output of the algorithm are the 
mean vector Xk and the covariance matrix P^. 



9.2 Our Construction 

Our novel contribution is a new efficient distributed algorithm for computing the Kalman filter. 
We begin by showing that the Kalman filter can be computed by inverting the following covariance 
matrix: 





A 





A^ 


Q 


H 







R 



E = \ Q H \ , (9.4) 

V R I 

■"■We assume there is no external input, namely Xk ~ Axk-i +Wk-i- However, our approach can be easily 
extended to support external inputs. 

^Another possible extension is that the matrices A, H, B,Q, R change in time, in this thesis we assume they 
are fixed. However, our approach can be generalized to this case as well. 
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and taking the lower right 1x1 block to be P^. 

The computation of E^^ can be done efficiently using recent advances in the field of Gaussian 
belief propagation [14,19]. The intuition for our approach, is that the Kalman filter is composed 
of two steps. In the prediction step, given Xk, we compute the MMSE prediction of [39]. 
In the measurement step, we compute the MMSE prediction of x^+i given x^, the output of 
the prediction step. Each MMSE computation can be done using the GaBP algorithm [19]. 
The basic idea is that given the joint Gaussian distribution p(x, y) with the covariance matrix 

C = ( ^""'^ ^^'^ I , we can compute the MMSE prediction 



^yx yy 



where 



y = aTgmaxp{y\x) oc J^ifiy\x, ^y\l) 



l^y\x i'^yy ^yx^xx^xy) "^yx^xx-^ -I 
'^y\x = ^yy ~ ^yx^xx^xy) 

This in turn is equivalent to computing the Schur complement of the lower right block of the 
matrix C . In total, computing the MMSE prediction in Gaussian graphical model boils down to a 
computation of a matrix inverse. In [14] we have shown that GaBP is an efficient iterative algo- 
rithm for solving a system of linear equations (or equivalently computing a matrix inverse). In [19] 
we have shown that for the specific case of linear detection we can compute the MMSE estimator 
using the GaBP algorithm. Next, we show that performing two consecutive computations of the 
MMSE are equivalent to one iteration of the Kalman filter. 

Theorem 26. The lower right 1x1 block of the matrix inverse (eq. 9.4), computed by two 
MMSE iterations, is equivalent to the computation of done by one iteration of the Kalman 
filter algorithm. 



Proof. We prove that inverting the matrix E (eq. 9.4) is equivalent to one iteration of the 
Kalman filter for computing P^. 

We start from the matrix E and show that can be computed in recursion using the Schur 
complement formula: 

D - CA~^B (9.5) 

applied to the 2x2 upper left submatrix of E, where D = Q,C = A'^, B = A, A = Pk-i, we 
get: 

D - C B 



Now we compute recursively the Schur complement of lower right 2x2 submatrix of the 
matrix E using the matrix inversion lemma: 



A~^ + A~^B{D - CA-^Py^CA^^ 
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where A-^ = , B = ,C = H, D = Q. In total we get: 



Pk +Pj; H^iR + H P- H^r' H P- = (9.6) 
(9.3a) (9.3c) 



(/ - P^H^{HP-H^ + RY^ H)P^ = {I- KuH)P, = P, 

□ 

In Section 9.2 we explain how to utilize the above observation to an efficient distributed 
iterative algorithm for computing the Kalman filter. 

9.3 Gaussian Information Bottleneck 

Given the joint distribution of a source variable X and another relevance variable Y, Information 
bottleneck (IB) operates to compress X, while preserving information about Y [94,95], using the 
following variational problem: 

min £ : £ = /(X; T) - /5/(T; Y) 

p{t\x) 

T represents the compressed representation of X via the conditional distributions p{t\x), while 
the information that T maintains on Y is captured by the distribution p{y\t). /5 > is a 
lagrange multiplier that weights the tradeoff between minimizing the compression information 
and maximizing the relevant information. As /3 ^ we are interested solely in compression, 
but all relevant information about Y is lost I{Y]T) = 0. When /3 ^ oo we are focused on 
preservation of relevant information, in this case T is simply the distribution X and we obtain 
/(T; Y) = I{X; Y). The interesting cases are in between, when for finite values of /5 we are able 
to extract rather compressed representation of X while still maintaining a significant fraction of 
the original information about Y. 

An iterative algorithm for solving the IB problem is given in [95]: 



P\t)= l^p{x)P^{t\x)dx, (9.7a) 

P'(y\i) = pk)L P'{t\x)pix, y)dx , (9.7b) 

where Z''^^ is a normalization factor computed in round k + 1. 

The Gaussian information bottleneck (GIB) [92] deals with the special case where the under- 
lying distributions are Gaussian. In this case, the computed distribution p{t) is Gaussian as well, 
represented by a linear transformation Tk = AkX + where Ak is a joint covariance matrix 
of X and T, ~ A/'(0, S^^) is a multivariate Gaussian independent of X. The outputs of the 
algorithm are the covariance matrices representing the linear transformation T: A^jE^^.. 
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Xltl Xk Xk 

(c) 




(d) 



Figure 9.1: Comparison of the different graphical models used, (a) Gaussian Information Bottle- 
neck [92] (b) Kalman Filter (c) Frey's sum-product factor graph [39] (d) Our new construction. 



An iterative algorithm is derived by substituting Gaussian distributions into (9.7), resulting in 
the following update rules: 

S^+i = (9.8a) 
= /?S5,+iS-|^Afc(/-S,|,.S;i). (9.8b) 



Table 9.1: Summary of notations in the GIB [92] paper vs. Kalman filter [91] 



GIB [92] 


Kalman [91] 


Kalman meaning 






a-priori estimate error covariance 




Q 


process AWGN noise 




R 


measurement AWGN noise 




A 


process state transformation matrix 


^yx 










measurement transformation matrix 




H 






Pk 


posterior error covariance in round k 


^x\yk 


Pk 


a-priori error covariance in round k 



Since the underlying graphical model of both algorithms (GIB and Kalman filter) is Markovian 
with Gaussian probabilities, it is interesting to ask what is the relation between them. In this work 
we show, that the Kalman filter posterior error covariance computation is a special case of the 
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GIB algorithm when /? = 1. Furthermore, we show how to compute GIB using the Kalman filter 
when P > 1 (the case where < /3 < 1 is not interesting since it gives a degenerate solution 
where A^ = [92].) Table 9.1 outlines the different notations used by both algorithms. 

Theorem 27. The GIB algorithm when P = 1 is equivalent to the Kalman filter algorithm. 

Proof. Looking at [92, §39], when /3 = 1 we get 

MMSE [92, §38b] 
-1 ' " — ~i ^ ' ^ ' 

[92, §34] [92, §34] [92, §33] [92, §33] 



^ s . ^ 



^tfe + St/S<fc2/ ^y\tk ^ytk^t^^ = A^^xA + + {A^T^xA + S^) A'^T.^y- 
[92, §33] 

■^ylt.^yxA + Sg)^ = A^ExA + + (A^S.A + J:^)A^Exy 

MMSE 

• {Ey + Eyt^E;^^EtJ EyxAiA^ExA + S^)^ = E , A + + (A^ExA + E^)A'^E,y- 
[92, §5] ^ ([92, §5] [92, §5] 

(S,y + A ^yx {AYixA + S^) YixyA )YiyxA(^A'^YixA-\-Yi^')'^. 
Now we show this formulation is equivalent to the Kalman filter with the following notations: 

Substituting we get: 

Pk Pk R H Pk hT H ^ " 

{A'T.xA + S^) + {A'^xA + Sg) A'T.xy -(t^ + A^T.yx {A^^xA + S^) (A^S^A -\ 

Which is equivalent to (9.6). Now we can apply Theorem 1 and get the desired result. □ 

Theorem 28. The GIB algorithm when /? > 1 can be computed by a modified Kalman filter 
iteration. 

Proof. In the case where > 1, the MAP covariance matrix as computed by the GIB algorithm 
is: 

=/?S,,|, + (l-/3)S,, (9.9) 

This is a weighted average of two covariance matrices. is computed at the first phase of 
the algorithm (equivalent to the prediction phase in Kalman literature), and '^tk\y computed 
in the second phase of the algorithm (measurement phase). At the end of the Kalman iteration, 
we simply compute the weighted average of the two matrices to get (9.9). Finally, we compute 
Afc+i using (eq. 9.8b) by substituting the modified □ 
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Figure 9.2: Comparison of the schematic operation of the different algorithms, (a) iterative 
information bottleneck operation (b) Kalman filter operation (c) Affine-scaling operation. 



There are some differences between the GIB algorithm and Kalman filter computation. First, 
the Kalman filter has input observations Zk in each round. Note that the observations do not 
affect the posterior error covariance computation (eq. 9.3c), but affect the posterior mean 
Xk (eq. 9.3b). Second, Kalman filter computes both posterior mean and error covariance P^. 
The covariance T.^^ computed by the GIB algorithm was shown to be identical to Pk when (3=1. 
The GIB algorithm does not compute the posterior mean, but computes an additional covariance 
Ak (eq. 9.8b), which is assumed known in the Kalman filter. 

From the information theoretic perspective, our work extends the ideas presented in [96]. 
Predictive information is defined to be the mutual information between the past and the future of 
a time serias. In that sense, by using Theorem 2, Kalman filter can be thought of as a prediction 
of the future, which from the one hand compresses the information about past, and from the 
other hand maintains information about the present. 

The origins of similarity between the GIB algorithm and Kalman filter are rooted in the IB 
iterative algorithm: For computing (9.7a), we need to compute (9. 7a, 9. 7b) in recursion, and 
vice versa. 



9.4 Relation to the AfFine-Scaling Algorithm 

One of the most efficient interior point methods used for linear programming is the Affine- 
scaling algorithm [97]. It is known that the Kalman filter is linked to the Affine-scaling algorithm 
[93]. In this work we give an alternate proof, based on different construction, which shows that 
Affine-scaling is an instance of Kalman filter, which is an instance of GIB. This link between 
estimation and optimization allows for numerous applications. Furthermore, by providing a single 
distribute efficient implementation of the GIB algorithm, we are able to solve numerous problems 
in communication networks. 
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The linear programming problem in its canonical form is given by: 

minimize c^x, (9.10a) 

subject to Ax = b, x > , (9.10b) 

where A G M"^^ with rankjA} = p < n. We assume the problem is solvable with an optimal 
X*. We also assume that the problem is strictly feasible, in other words there exists x G M" that 
satisfies Ax = b and x > 0. 

The Affine-scaling algorithm [97] is summarized below. Assume xq is an interior feasible point 
to (9.10b). Let D = diag{xQ). The Affine-scaling is an iterative algorithm which computes a 
new feasible point that minimizes the cost function (10.1a): 

xi =xo--D'r, (9.11) 

7 

where < a < 1 is the step size, r is the step direction. 

r = (c-A^w), (9.12a) 
w = {AD^A^y'AD^c, (9.12b) 
7 = max{eiPDc). (9.12c) 

i 

where is the z*^ unit vector and P is a projection matrix given by: 

P = I - DA^iAD^A^y^AD. (9.13) 

The algorithm continues in rounds and is guaranteed to find an optimal solution in at most n 
rounds. In a nutshell, in each iteration, the Affine-scaling algorithm first performs an Affine- 
scaling with respect to the current solution point Xj and obtains the direction of descent by 
projecting the gradient of the transformed cost function on the null space of the constraints set. 
The new solution is obtained by translating the current solution along the direction found and 
then mapping the result back into the original space [93]. This has interesting analogy for the 
two phases of the Kalman filter. 

Theorem 29. The Affine-scaling algorithm iteration is an instance of the Kalman filter algorithm 
iteration. 

Proof We start by expanding the Affine-scaling update rule: 




Xl = Xn D r = Xn D t == Xn ; , ^" , r, , r D T 

^ ° ^ ° maxe,;PL)c o - DA^{AD^A^)AD) Dc 

(9.13) 
{9.12b, 




ap^ (c - A^w) _ _ «d2(c-a^ (AD^A'^y^AD'^c) 

-^0 ma.xie,{I-DAT{AD^AT)-^AD)Dc ~ maxiei{I~DAT{AD^AT)AD)~^Dc 
_ _ aD{I~DAT {AD'^ AT)-^ AD)Dc 
~ -^0 ra!a^^ei{I-DAT{AD'^AT)-^AD)Dc ■ 
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Looking at the numerator and using the Schur complement formula (9.5) with the follow- 
ing notations: A = {AD^A'^)-\B = AD,C = DA^.D = / we get the following matrix: 

AD'^A^ AD \ AAA 
jjj^ J ]■ Again, the upper left block is a Schur complement A = 0, B = AD,C = 

DA^ , D = I of the following matrix: ^ jjj^r ^ ■ ^oX.3\ we get a 3 x 3 block matrix of 

/ AD Q 
the form: DA^ I AD 

\ DA^ I 
Note that the divisor is a scalar that affects the scaling of the step size. 

Using Theorem 1, we get a computation of Kalman filter with the following parameters: 
A, H = AD,Q = I,R = I,Po = 0. This has an interesting interpretation in the context of 
Kalman filter: both prediction and measurement transformation are identical and equal AD. The 
noise variance of both transformations are Gaussian variables with prior ~ J\f{0,I). □ 

We have shown how to express the Kalman filter, Gaussian information bottleneck and Affine- 
scaling algorithms as a two step MMSE computation. Each step involves inverting a 2 x 2 block 
matrix. The MMSE computation can be done efficiently and distributively using the Gaussian 
belief propagation algorithm. 
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In recent years, considerable attention has been dedicated to the relation between belief prop- 
agation message passing and linear programming schemes. This relation is natural since the 
maximum a-posteriori (MAP) inference problem can be translated into integer linear program- 
ming (ILP) [98]. 

Weiss et al. [98] approximate the solution to the ILP problem by relaxing it to a LP problem 
using convex variational methods. In [99], tree-reweighted belief propagation (BP) is used to find 
the global minimum of a convex approximation to the free energy. Both of these works apply 
discrete forms of BP. Globerson et al. [100,101] assume convexity of the problem and modify the 
BP update rules using dual-coordinate ascent algorithm. Hazan et al. [84] describe an algorithm 
for solving a general convex free energy minimization. In both cases the algorithm is guaranteed 
to converge to the global minimum as the problem is tailored to be convex. 

This chapter takes a different path. Unlike most of the previous work, which uses gradient- 
descent methods, we show how to use interior-point methods, which are shown to have strong 
advantages over gradient and steepest descent methods. (For a comparative study see [102, 
§9. 5, p. 496].) The main benefit of using interior point methods is their rapid convergence, 
which is quadratic once we are close enough to the optimal solution. Their main drawback 
is that they require heavier computational effort for forming and inverting the Hessian matrix 
needed for computing the Newton step. To overcome this, we propose the use of Gaussian BP 
(GaBP) [14,7], which is a variant of BP applicable when the underlying distribution is Gaussian. 
Using GaBP, we are able to reduce the time associated with the Hessian inversion task, from 
0(n^-^) to 0(?iplog(e)/log(7)) at the worst case, where p < n \s the size of the constraint matrix 
A, e is the desired accuracy, and 1/2 < 7 < 1 is a parameter characterizing the matrix A. This 
computational savings is accomplished by exploiting the sparsity of the Hessian matrix. 

An additional benefit of our GaBP-based approach is that the polynomial-complexity LP solver 
can be implemented in a distributed manner, enabling efficient solution of large-scale problems. 
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10.1 Standard Linear Programming 

Consider the standard linear program 

minimizex c^x, (10.1a) 

subject to Ax = b, x>0, (10.1b) 

where A G W^^^ with rank{A} = p < n. We assume the problem is solvable with an optimal 
X* assignment. We also assume that the problem is strictly feasible, or in other words there exists 
X e M" that satisfies Ax = b and x > 0. 

Using the log-barrier method [102, §11.2], one gets 

minimizex,^ c^x — /iS^^j^ logXfc , (10.2a) 
subject to Ax = b. (10.2b) 

This is an approximation to the original problem (10.1a). The quality of the approximation 
improves as the parameter n ^ 0. 

Table 10.1: The Newton algorithm [102, §9.5.2] . 

Given feasible starting point xq and tolerance e > 0, A; = 1 

Repeat 1 Compute the Newton step and decrement 
Ax=r(x)-V'(x), A2 = f(x)^Ax 

2 Stopping criterion, quit if A^/2 < e 

3 Line search. Choose step size t by backtracking line search. 

4 Update, x^ := Xfe_i + tAx, k = k + 1 



Now we would like to use the Newton method for solving the log-barrier constrained objective 
function (10.2a), described in Table 10.1. Suppose that we have an initial feasible point xq for 
the canonical linear program (10.1a). We approximate the objective function (10.2a) around the 
current point x using a second-order Taylor expansion 

/(i + Ax) ~ /(i) + /'(i) Ax + l/2Ax^/"(5c)Ax. (10.3) 

Finding the optimal search direction Ax yields the computation of the gradient and compare it 
to zero 

^ = /'(i) + r(i)Ax = 0, (10.4) 

A^ = -f"{±r'f'{±). (10.5) 

Denoting the current point x = (x, /i, y) and the Newton step Ax = (x, y,yu), we compute 
the gradient 

/'(x, /i, y) = (<9/(x, /i, y)/9x, 9/(x, /i, y)/<9/i, 9/(x, /i, y)/dy) 
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The Lagrangian is 

>C(x,yU,y) = c^x-/iSfcloga;fc + y^(b- Ax), (10.7) 

5^^|P^ = c-^X-l-y-A = 0, (10.8) 

= ^X-, (10.9) 

ax 

where X = diag(x) and 1 is the all-one column vector. Substituting (10.8)-(10.9) into (10.4), 
we get 

c - /iX^H - y^A + ;uX"^x = , (10.10) 

c-/iX^^l + x^X^2 = y^A, (10.11) 

^%i^ = Ax = 0. (10.12) 

dy 

Now multiplying (10.11) by AX^, and using (10.12) to eliminate x we get 

AX^A^y = AX^c - /iAXl . (10.13) 

These normal equations can be recognized as generated from the linear least-squares problem 

min||XA^y-Xc-/iAXl||2. (10.14) 
y 

Solving for y we can compute the Newton direction x, taking a step towards the boundary 
and compose one iteration of the Newton algorithm. Next, we will explain how to shift the 
deterministic LP problem to the probabilistic domain and solve it distributively using GaBP. 

10.2 From LP to Probabilistic Inference 

We start from the least-squares problem (10.14), changing notations to 

min||Fy-g||2, (10.15) 
y 

where F = XA^, g = Xc + ^uAXl. Now we define a multivariate Gaussian 

p(x) ^ p(x, y) cc exp(-l/2(Fy - g)^I(Fy - g)) . (10.16) 

It is clear that y, the minimizing solution of (10.15), is the MAP estimator of the conditional 
probability 

y = argmaxp(y|x) = Af{{F^F)~^F^g, (F^F)"^) . 
y 
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As shown in Chapter 7, the pseudo-inverse solution can be computed efficiently and distribu- 
tively by using the GaBP algorithm. 

The formulation (10.16) allows us to shift the least-squares problem from an algebraic to 
a probabilistic domain. Instead of solving a deterministic vector-matrix linear equation, we now 
solve an inference problem in a graphical model describing a certain Gaussian distribution function. 
We define the joint covariance matrix 

C=(f^^). (10.17) 

and the shift vector b = {0^,g^}^ G 

Given the covariance matrix C and the shift vector b, one can write explicitly the Gaussian den- 
sity function, p(x) , and its corresponding graph Q with edge potentials ('compatibility functions') 
^ij and self-potentials ('evidence') 0j. These graph potentials are determined according to the 
following pairwise factorization of the Gaussian distribution j»(x) oc HILi 4'i{'^i) Y\.{ij} ^iji^i^ ^j); 
resulting in 'ipij(xi,Xj) = exp(— XjCjjXj), and (pi^Xi) = exp (biXi — Ciixf/2j. The set of edges 
{i,j} corresponds to the set of non-zero entries in C (10.17). Hence, we would like to calculate 
the marginal densities, which must also be Gaussian, 

\/i > p, 

where /i^ and Pi are the marginal mean and inverse variance (a.k.a. precision), respectively. 
Recall that in the GaBP algorithm, the inferred mean /ij is identical to the desired solution y of 
(10.17). 



10.3 Extending the Construction to the Primal-Dual Method 

in the previous section we have shown how to compute one iteration of the Newton method 
using GaBP. in this section we extend the technique for computing the primal-dual method. This 
construction is attractive, since the extended technique has the same computation overhead. 
The dual problem ( [103]) conforming to (10.1a) can be computed using the Lagrangian 

£(x, y, z) = c^x + y^(b - Ax) - z^x, z > 0, 



g{y,z) = inf /:(x,y,z), 

X 

subject to Ax = b, x > 0. 



while 



a£(x,y,z) 
c^x 



A^y - z = 0. 



(10.18a) 
(10.18b) 

(10.19) 
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Substituting (10.19) into (10.18a) we get 

maximizey b^y 

subject to A^y + z = c, z > 0. 

Primal optimality is obtained using (10.8) [103] 

y^A = c- (10.21) 

Substituting (10.21) in (10.20a) we get the connection between the primal and dual 

^X~4 = z. 

In total, we have a primal-dual system (again we assume that the solution is strictly feasible, 
namely x > 0, z > 0) 

Ax = b, X > 0, 
A^y + z = c, z > 0, 
Xz = /il. 



The solution [x(yu), y(yu), z(/i)] of these equations constitutes the central path of solutions to 
the logarithmic barrier method [102, 11.2.2]. Applying the Newton method to this system of 
equations we get 

b-Ax 

c - A^y - z I . (10.23) 
/il — Xz 

The solution can be computed explicitly by 

Ay = (AZ-iXA^)-i(AZ-iX(c - /iX'^l - A^y) + b - Ax), 
Ax= XZ-i(A^Ay + /iX-il = c + A^y), 
Az = -A^Ay + c - A^y - z. 

The main computational overhead in this method is the computation of (AZ~^XA^)~\ which 
is derived from the Newton step in (10.5). 

Now we would like to use GaBP for computing the solution. We make the following simple 
change to (10.23) to make it symmetric: since z > 0, we can multiply the third row by Z~^ and 
get a modified symmetric system 

b- Ax 

c — A-^y — z 
/iZ-^l - X 

b- Ax 

Defining A = ( A I , and b = j c — A^y — z | . one can use the GaBP 

^X y \ /iZ^^l-X 

algorithm. 






A^ 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



Figure 10.1: A simple example of using GaBP for solving linear programming with two variables 
and eleven constraints. Each red circle shows one iteration of the Newton method. 



In general, by looking at (10.4) we see that the solution of each Newton step involves inverting 
the Hessian matrix /"(x). The state-of-the-art approach in practical implementations of the 
Newton step is first computing the Hessian inverse /"(x)"^ by using a (sparse) decomposition 
method like (sparse) Cholesky decomposition, and then multiplying the result by /'(x). In our 
approach, the GaBP algorithm computes directly the result Ax, without computing the full matrix 
inverse. Furthermore, if the GaBP algorithm converges, the computation of Ax is guaranteed to 
be accurate. 



10.3.1 Applications to Interior-Point Methods 

We would like to compare the running time of our proposed method to the Newton interior-point 
method, utilizing our new convergence results of the previous section. As a reference we take the 
Karmarkar algorithm [104] which is known to be an instance of the Newton method [105]. Its 
running time is composed of n rounds, where on each round one Newton step is computed. The 
cost of computing one Newton step on a dense Hessian matrix is 0{n'^^^), so the total running 
time is 0(n^-^). 

Using our approach, the total number of Newton iterations, n, remains the same as in the 
Karmarkar algorithm. However, we exploit the special structure of the Hessian matrix, which is 
both symmetric and sparse. Assuming that the size of the constraint matrix A is n x p, p < 
n, each iteration of GaBP for computing a single Newton step takes 0{np), and based on 
the convergence analysis in Section 3.1, for a desired accuracy e||b||oo we need to iterate for 
r = [log(e)/log(7)] rounds, where 7 is defined in (3.1). The total computational burden for a 
single Newton step is 0(nplog(e)/log(7)). There are at most n rounds, hence in total we get 
0(nVog(e)/log(7)). 
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10.4 Case Study: Network Utility Maximization 

We consider a network that supports a set of flows, each of which has a nonnegative flow rate, 
and an associated utility function. Each flow passes over a route, which is a subset of the edges 
of the network. Each edge has a given capacity, which is the maximum total traffic (the sum 
of the flow rates through it) it can support. The network utility maximization (NUM) problem 
is to choose the flow rates to maximize the total utility, while respecting the edge capacity 
constraints [106,107]. We consider the case where all utility functions are concave, in which case 
the NUM problem is a convex optimization problem. 

A standard technique for solving NUM problems is based on dual decomposition [108,109]. 
This approach yields fully decentralized algorithms, that can scale to very large networks. Dual 
decomposition was first applied to the NUM problem in [110], and has led to an extensive body 
of research on distributed algorithms for network optimization [111,112,113] and new ways to 
interpret existing network protocols [114]. 

Recent work by Zymnis et al. [115], presents a specialized primal-dual interior-point method 
for the NUM problem. Each Newton step is computed using the preconditioned conjugate gradient 
method (PCG). This proposed method had a significant performance improvement over the dual 
decomposition approach, especially when the network is congested. Furthermore, the method 
can handle utility functions that are not strictly concave. The main drawback of the primal-dual 
method is that it is centralized, while the dual decomposition methods are easily distributed. 

Next, we compare the performance of the GaBP solver proposed in this chapter, to the 
truncated Newton method and dual decomposition approaches. We provide the first comparison 
of performance of the GaBP algorithm vs. the PCG method. The PCG method is a state- 
of-the-art method used extensively in large-scale optimization applications. Examples include 
^i-regularized logistic regression [116], gate sizing [117], and slack allocation [118]. Empirically, 
the GaBP algorithm is immune to numerical problems with typically occur in the PCG method, 
while demonstrating a faster convergence. The only previous work comparing the performance of 
GaBP vs. PCG we are aware of is [119], which used a small example of 25 nodes, and the work 
of [8] which used a grid of 25 x 25 nodes. 

We believe that our approach is general and not limited to the NUM problem. It could 
potentially be used for the solution of other large scale distributed optimization problems. 

10.4.1 NUM Problem Formulation 

There are n flows in a network, each of which is associated with a fixed route, i.e. , some subset 
of m links. Each flow has a nonnegative rate, which we denote With the flow j 

we associate a utility function Uj : R ^ R, which is concave and twice differentiable, with 
domf/j C R^. The utility derived by a flow rate fj is given by Uj{fj). The total utility 
associated with all the flows is then U{f) = f/i(/i) + ■ ■ ■ + Un{fn)- 

The total traffic on a link in the network is the sum of the rates of all flows that utilize that 
link. We can express the link traffic compactly using the routing or link-route matrix R E R"^^", 
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defined as 

1 flow j's route passes over link i 
otherwise. 



Rij 



Each link in the network has a (positive) capacity Ci, . . . , Cm- The traffic on a link cannot exceed 
its capacity, i.e. , we have Rf < c, where < is used for componentwise inequality. 

The NUM problem is to choose the rates to maximize total utility, subject to the link capacity 
and the nonnegativity constraints: 

maximize U{f) (^a?A^ 
subject to Rf<c, / > 0, ^lu.zt; 

with variable / G R". This is a convex optimization problem and can be solved by a variety of 
methods. We say that / is primal feasible if it satisfies Rf < c, / > 0. 
The dual of problem (10.24) is 

minimize A^c + E;=i(-f/.)*(-rjA) 

subject to A > 0, \ ■ ) 

where A G is the dual variable associated with the capacity constraint of problem (10.24), rj 
is the jth column of i? and {—Uj)* is the conjugate of the negative jth utility function [120, §3.3], 

(— f/j)*(a) = sup(ax + Uj{x)). 

x>0 

We say that A is dual feasible if it satisfies A > and A G n"^i dom{-Uj)* . 



10.4.2 Previous Work 

In this section we give a brief overview of the dual-decomposition method and the primal-dual 
interior point method proposed in [115]. 

Dual decomposition [108, 109, 110, 111] is a projected (sub)gradient algorithm for solving 
problem (10.25), in the case when all utility functions are strictly concave. We start with any 
positive A, and repeatedly carry out the update 

fj := aigmax {Uj{x) - x{rJX)) , j = l,...,n, 

x>0 

A := (A - a (c 

where a > is the step size, and x+ denotes the entrywise nonnegative part of the vector x. It 
can be shown that for small enough a, f and A will converge to /* and A*, respectively, provided 
all Uj are differentiable and strictly concave. The term s = c — Rf appearing in the update 
is the slack in the link capacity constraints (and can have negative entries during the algorithm 
execution). It can be shown that the slack is exactly the gradient of the dual objective function. 

Dual decomposition is a distributed algorithm. Each flow is updated based on information 
obtained from the links it passes over, and each link dual variable is updated based only on the 
flows that pass over it. 
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The primal-dual interior-point method is based on using a Newton step, applied to a suitably 
modified form of the optimality conditions. The modification is parameterized by a parameter t, 
which is adjusted during the algorithm based on progress, as measured by the actual duality gap 
(if it is available) or a surrogate duality gap (when the actual duality gap is not available). 

We first describe the search direction. We modify the complementary slackness conditions to 
obtain the modified optimality conditions 

-VU{f) + R^X- 12 = 

diag(A)s = {l/t)l 
diag(/i)/ = 

where t > is a parameter that sets the accuracy of the approximation. (As t ^ oo, we recover 
the optimality conditions for the NUM problem.) Here we implicitly assume that /, s, A,/i > 0. 
The modified optimality conditions can be compactly written as rt{f,\,fi) = 0, where 

" -Vf/(/) +R^X-fi 
n{f,X,fx)= diag(A)s-(l/t)l 
_ diag(/i)/-(l/t)l 

The primal-dual search direction is the Newton step for solving the nonlinear equations 
rj(/, A, /i) = 0. If y = (/, A, fi) denotes the current point, the Newton step Ay = (A/, AA, A/i) 
is characterized by the linear equations 



niy + Ay) 
which, written out in more detail, are 



ftijj) +r'tiy)Ay = 



- diag(A)i? diag(s) 
diag(^) 





" A/ " 






AA 











(10.26) 



-/ 


diag(/) 

During the algorithm, the parameter t is increased, as the primal and dual variables approach 
optimality. When we have easy access to a dual feasible point during the algorithm, we can make 
use of the exact duality gap rj to set the value of t; in other cases, we can use the surrogate 
duality gap t). 

The primal-dual interior point algorithm is given in [120, §11.7], [121]. 

The most expensive part of computing the primal-dual search direction is solving equation 
(10.26). For problems of modest size, i.e. , with m and n no more than 10^, it can be solved 
using direct methods such as a sparse Cholesky decomposition. 

For larger problem instances [115] proposes to solve (10.26) approximately, using a precon- 
ditioned conjugate gradient (PCG) algorithm [122, §6.6], [123, chap. 2], [124, chap. 5]. When 
an iterative method is used to approximately solve a Newton system, the algorithm is referred to 
as an inexact, iterative, or approximate Newton method (see [123, chap. 6] and its references). 
When an iterative method is used inside a primal-dual interior-point method, the overall algorithm 
is called a truncated-Newton primal-dual interior-point method. For details of the PCG algorithm, 
we refer the reader to the references cited above. Each iteration requires multiplication of the 
matrix by a vector, and a few vector inner products. 
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10.4.3 Experimental Results 

In our first example we look at the performance of our method on a small network. The utility 
functions are all logarithmic, i.e. , Uj{fj) = logfj. There are n = 10^ flows, and m = 2 ■ 10^ 
links. The elements of R are chosen randomly and independently, so that the average route 
length is 10 links. The link capacities q are chosen independently from a uniform distribution on 
[0.1, 1]. For this particular example, there are about lO'^ nonzero elements in R (0.5% density). 

We compare three different algorithms for solving the NUM problem: The dual-decomposition 
method, a truncated Newton method via PCG and a customized Newton method via the GaBP 
solver. Out of the examined algorithms, the Newton method is centralized, while the dual- 
decomposition and GaBP solver are distributed algorithms. The source code of our Matlab 
simulation is available on [77]. 




Customized Newton via GaBP 
Trancated Newton via PCG 
Dual decomposition 



50 100 150 200 250 300 

Iteration number 

Figure 10.2: Convergence rate using the small settings. 

Figure 10.2 depicts the solution quality, where the X-axis represents the number of algorithm 
iterations, and the Y-axis is the surrogate duality gap (using a logarithmic scale). As clearly shown, 
the GaBP algorithm has a comparable performance to the sparse Cholesky decomposition, while 
it is a distributed algorithm. The dual decomposition method has much slower convergence. 

Our second example is too large to be solved using the primal-dual interior-point method with 
direct search direction computation, but is readily handled by the truncated-Newton primal-dual 
algorithm using PCG, the dual decomposition method and the customized Newton method via 
GaBP. The utility functions are all logarithmic: Uj{fj) = log/,. There are n = 10^ flows, and 
m = 2 ■ 10^ links. The elements of R and c are chosen as for the small example. For dual 
decomposition, we initialized all as 1. For the interior-point method, we initialized all Aj and 
Hi as 1. We initialize all fj as 7, where we choose 7 so that Rf < 0.9c. 

Our experimental results shows, that as the system size grows larger, the GaBP solver has 
favorable performance. Figure 10.3 plots the duality gap of both algorithms, vs. the number of 
iterations performed. 
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Figure 10.3: Convergence rate in the larger settings. 



Figure 10.4 shows that in terms of Newton steps, both methods had comparable performance. 
The Newton method via the GaBP algorithm converged in 11 steps, to an accuracy of 10^^ where 
the truncated Newton method implemented via PCG converged in 13 steps to the same accuracy. 
However, when examining the iteration count in each Newton step (the Y-axis) we see that the 
GaBP remained constant, while the PCG iterations significantly increase as we are getting closer 
to the optimal point. 



120r 
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Newton step number 
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Figure 10.4: Iteration count per Newton step. 



We have experimented with larger settings, up to = 10^ flows, and m = 2 ■ 10^ links. The 
GaBP algorithm converged in 11 Newton steps with 7-9 inner iteration in each Newton step. The 
PCG method converged in 16 Newton steps with an average of 45 inner iterations. 

Overall, we have observed three types of numerical problems with the PCG method. First, 
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the PCG Matlab implementation runs into numerical problems and failed to compute the search 
direction. Second, the line search failed, which means that no progress is possible in the computed 
direction without violating the problem constraints. Third, when getting close to the optimal 
solution, the number of PCG iterations significantly increases. 

The numerical problems of the PCG algorithm are well known, see of example [125,126]. In 
contrary, the GaBP algorithm did not suffer from the above numerical problems. 

Furthermore, the PCG is harder to distribute, since in each PCG iteration a vector dot product 
and a matrix product are performed. Those operations are global, unlike the GaBP which exploits 
the sparseness of the input matrix. 

We believe that the NUM problem serves as a case study for demonstrating the superior 
performance of the GaBP algorithm in solving sparse systems of linear equations. Since the 
problem of solving a system of linear equations is a fundamental problem in computer science 
and engineering, we envision many other applications for our proposed method. 
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In this chapter we discuss the relation between the GaBP algorithm and several other algorithms, 
and show that all of them are instances of the GaBP algorithm. Thus, a single efficient imple- 
mentation of GaBP can efficiently compute different cost functions without the need of deriving 
new update rules. Furthermore, it is easier to find sufficient conditions for convergence in our 
settings. 



11.1 Montanari's Linear Detection Algorithm 

in this section we show that the Montanari's algorithm [50], which is an iterative algorithm for 
linear detection, is an instance of Gaussian BP. A reader that is not familiar with Montanari's 
algorithm or linear detection is referred to Chapter 7. Our improved algorithm for linear detection 
described in Section 7.1 is more general. First, we allow different noise level for each received 
bit, unlike his work that uses a single fixed noise for the whole system. In practice, the bits 
are transmitted using different frequencies, thus suffering from different noise levels. Second, 
the update rules in his paper are fitted only to the randomly-spreading CDMA codes, where 
the matrix A contains only values which are drawn uniformly from {—1, 1}. Assuming binary 
signalling, he conjectures convergence to the large system limit. Our new convergence proof 
holds for any CDMA matrices provided that the absolute sum of the chip sequences is one, under 
weaker conditions on the noise level. Third, we propose in [7] an efficient broadcast version for 
saving messages in a broadcast supporting network. 

The probability distribution of the factor graph used by Montanari is: 

1^1 ^ 1 

^t^y'^ = -^NjiYl^^vi-2^'^^^^+ jya^a)Yl^^Pi-2^'^i) ■Ylexp{--^SaiUJaXi)du 

y a=l i=l i,a 

Extracting the self and edge potentials from the above probability distribution: 

ipii{xi) = exp(-^x^) oc A/'(x; 0, 1) 
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1paa{i^a) = exp {- -a^ + j J aUJ a) OC J\f{uJa; jja, f^'^) 
ifjiai^^i, LOa) = exp( y=SaiUJa^i) OC A/'(x; -4=Sai, 0) 

For convenience, Table 11.1 provides a translation between the notations used in [7] and that 
used by Montanari et al. in [50]: 



Table 11.1: Summary of notations 



GaBP [7] 


Montanari el al. [50] 


Description 


P 


.(t+i) 


precision msg from left to right 




Ut+l) 

/ 4— >a 


precision msg from right to left 




mean msg from left to right 




^ (t+1) 

^ . 

' a— *2 


mean msg from right to left 


l^ii 


Vi 


prior mean of left node 







prior mean of right node 


P 


1 


prior precision of left node 






prior precision of right node 




Gi 
U 


posterior mean of node 


P^ 


u 


posterior precision of node 


Aj 




covariance 




Vn 


covariance 




J 





Theorem 30. Montanari's update rules are special case of the GaBP algorithm. 

Proof. Now we derive Montanari's update rules. We start with the precision message from left 
to right: 

p ■ 




^ '^j^T '"i"^ ''-jS' 

b — >i 

By looking at Table 11.1, it is easy to verify that this precision update rule is equivalent to 2.17. 
Using the same logic we get the precision message from right to left: 

_a2 p-1 

p. »^ A« 
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The mean message from left to right is given by 



-A P~^' 

H'i 



The same calculation is done for the mean from right to left: 

li^a ~ iJa ]Sf '^^'^ > (t) ^k^a- 
^k-^a 

Finally, the left nodes calculated the precision and mean by 



^{t+l) _ _}_^ Sib ^ {t) J _ ^-i 



r{«+l) _ 1 , V ,, _ r (^-1 

Li - 1 + jY^bT^ ■ l^i - Li^i . 



□ 



The key difference between the two constructions is that Montanari uses a directed factor 
graph while we use an undirected graphical model. As a consequence, our construction provides 
additional convergence results and simpler update rules. 



11.2 Frey's Local Probability Propagation Algorithm 

Frey's local probability propagation [127] was published in 1998, before the work of Weiss on 
Gaussian Belief propagation. That is why it is interesting to find the relations between those two 
works. Frey's work deals with the factor analysis learning problem. The factor analyzer network 
is a two layer densely connected network that models bottom layer sensory inputs as a linear 
combination of top layer factors plus independent gaussian sensor noise. 

In the factor analysis model, the underlying distribution is Gaussian. There are n sensors that 
sense information generated from k factors. The prior distribution of the sensors is 

p{z) = J\f{z;0,I) , p{x\z) = J\f{z;Sx,'^) . 

The matrix S defines the linear relation between the factors and the sensors. Given S, the 
observation y and the noise level ^, the goal is to infer the most probable factor states x. It is 
shown that maximum likelihood estimation of S and \l/ performs factor analysis. 

The marginal distribution over x is: 



p{x) oc 



f 0, 1)N'{z; Sx, ^)rfx = ^^{z; 0, S^S + ^). 

J X 
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Table 11.2: Notations of GaBP (Weiss) and Frey's algorithm 



[7] 


Frey 


comments 




— 

Vnk 


precision message from left i to right a 




(i) 


precision message from right a to right i 




(i) 


mean message from left i to right a 




(i) 
Vkn 


mean message from right a to left i 




•^n 


prior mean of left node i 







prior mean of right node a 


P 

^ 11 


1 


prior precision of left node i 




Ai 


prior precision of right node i 






posterior mean of node i 


P^ 


4 


posterior precision of node i 




Anfc 


covariance of left node i and right node a 




1 


covariance of right node a and left node i 



This distribution is Gaussian as well, with the following parameters: 

E(z|x) = (S^S + *)-^S^y, 
Cov{z\x) = (S^S + *)-\ 

Theorem 31. Frey's iterative propagability propagation is an instance of GaBP. 

Proof. We start by showing that Frey's update rules are equivalent to the GaBP update rules. 
From right to left: 



'3 

i2 . ,(«-!) 



Vnk \2 kn \2 ' 

^nfc ^nk 

'"'^^^ V \ ^^^^ V 

(i) _ Xn - l^k^nkVkn , _ ^" ~ ^j¥=k11kn 

l^nk ~ \ ^ 'Ikn — \ 

^nk ^^^nk^ 

And from left to right: 

^= 1/(1/(1/(1 + s.iM2-i/^S))) = ' 1 7C 1 +s,^a/v^i^^), 
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^kn - VknK^nl^nklVnk " l^nklWnk) " VknK^j^kl^nklWnk) " 

__ 1 



P.-l 

IJ p 



1 ^£(0 

It is interesting to note, that in Frey's model the graph is directed. That is why the edges Aji = 1 
in the update rules from right to left. □ 



11.3 Moallami and Van-Roy's Consensus Propagation 

The consensus propagation (CP) [54] is a distributed algorithm for calculating the average value 
in the network. Given an adjacency graph matrix Q and a vector input values y, the goal is to 
compute the average value y. 

It is clear that the CP algorithm is related to the BP algorithm. However the relation to GaBP 
was not given. In this section we derive the update rules used in CP from the GaBP algorithm 
(Weiss). 

The self potentials of the nodes are 

iJii{xi) oc exp(-(xi - Hif) 

The edge potentials are: 

il)ij{xi,Xj) oc exp{-(3Qij{xi -Xjf) 
In total, the probability distribution p{x) 

p{x) oc Iliexp{-{xi - yi)yile(zE exp{-f]Qij{xi - Xjf) 

= exp(Si(-(xi - y,i)f - pEeesQijixi - Xjf) 

We want to find an assignment x* = maXxPix). It is shown in [54] that this assignment conforms 
to the mean value in the network: y = ^^iyi when P is very large. 

For simplicity of notations, we list the different notations in Table 11.3. 

Theorem 32. The consensus propagation algorithm is an instance of the Gaussian belief prop- 
agation algorithm. 

Proof We prove this theorem by substituting the self and edge potentials used in the CP paper 
in the belief propagation update rules, deriving the update rules of the CP algorithm. 

r . '- '- " ^ 

mij{xj) oc / exp(-(xi - ^uf) exp{-PQik{xi - Xjf) exp(-SfcPfcj(xj - fikiY)dxi = 



exY>{-x'l+2xiiiii-^'li-f3Qikx\+2f3QijXiXj-f3Qijx]-T.kPkix]+T.kPkili'ki-^k 
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Table 11.3: Notations of GaBP (Weiss) and Consensus Propagation 



Weiss 


CP 


comments 


Pn 


}C 


nrerision of ?/; •fT-l IT ^> tdi (tA 




KM 


mean of tpuixi) mki{xi) 


p 


J-ij{r^) 


precision message from i to j 




Qij{}C) 


mean message from i to j 




Vi 


prior mean of node i 


P 


1 


prior precision of node i 




y 


posterior mean of node i (identical for all nodes) 


Pt 


p 

^ 11 


posterior precision of node i 


b 


Qji 


covariance of nodes i and j 


b' 


Qij 


covariance of nodes i and j 


a 





variance of node i in the pairwise covariance matrix Vij 


c 





variance of node j in the pairwise covariance matrix Vij 



ax^ bx 
f ' ^ , s 

exp{-fil- PQijx'^j + j:kPijm'^j) / exp((l + PQij + T.kPki)xl + 2{[3QijXj + Pkil^ki + lJiii)xi dxi 

(11.1) 

Now we use the following integration rule: 

/ exp(— (ax^ + bx))dx = J 7r/aexp( — ) 
Jx 4a 

We compute the integral: 

ax^ bx 



„2 ^ 



exp(l + (3Qij + T.kPki)Xi + 2{(3QijXj + PkijJ'ki + IJ'ii)xi dxi 



62 

. 

32/12 ^2 



iiPQijXj + Pki^lk^ + IJ^uf ^ Qij^j + WQtj{PMl^ki + l^ti)Xj + (/ifc, + 



A/7r/aexp( — — ^ j oc exp 



1 + f3Qki + ^ Pki) 1 + ^Qi^^ + ^^^^^ 

k 



4a 

Substituting the result back into (11.1) we get: 



mij{xj) oc exp(-/ijj-/3Q.yX +SfcPj /ij ) exp 



P^Q'tj^'j + 2(3Qij{Pki^iki + l^ii)xj + (/ifci + /ii. 



2 ,,2 \ i 



(11.2) 
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For computing the precision, we take all the terms of x| from (11.2) and we get: 




-(3Q,,{l+f3Q,, + J:Pki)+/3^Ql 



1 + (3Qij + SPfe, 
1 + ^Pki 



(3Qi, + 1 + SPfc, 



The same is done for computing the mean, taking all the terms of Xj from equation (112): 



11.4 Quadratic Min-Sum Message Passing Algorithm 



The quadratic Min-Sum message passing algorithm was initially presented in [6]. It is a variant 
of the max-product algorithm, with underlying Gaussian distributions. The quadratic Min-Sum 
algorithm is an iterative algorithm for solving a quadratic cost function. Not surprisingly, as we 
have shown in Section 2.4 that the Max-Product and the Sum-Product algorithms are identical 
when the underlying distributions are Gaussians. In this chapter, we show that the quadratic 
Min-Sum algorithm is identical to the GaBP algorithm, although it was derived differently. 

In [6] the authors discuss the application for solving linear system of equations using the 
Min-Sum algorithm. Our work [7] was done in parallel to their work, and both papers appeared 
in the 45th Allerton 2007 conference. 

Theorem 33. The Quadratic Min-Sum algorithm is an instance of the GaBP algorithm. 
Proof. We start in the quadratic parameter updates: 




□ 



1 - ^u(iN{i)\j^'iilui 




Which is equivalent to 2.17. Regarding the mean parameters. 



1 ~ '^u&N(i)\j^ 

Which is equivalent to 2.18. 




□ 
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For simplicity of notations, we list the different notations in Table 11.4. As shown in Ta- 



Table 11.4: Notations of Min-Sum [6] vs. GaBP 



Min-Sum [6] 


GaBP [7] 


comments 


(f+i) 


A3 


quadratic parameters / product rule precision from i to j 


hi 


h 


linear parameters / product rule mean rom i to j 
prior mean of node i 


Xi 


1 

Xi 


prior precision of node i 
posterior mean of node i 


r 


Pi 

Aij 


posterior precision of node i 
covariance of nodes i and j 



ble 11.4, the Min-Sum algorithm assumes the covariance matrix F is first normalized s.t. the 
main diagonal entries (the variances) are all one. The messages sent in the Min-Sum algorithm 
are called linear parameters (which are equivalent to the mean messages in GaBP) and quadratic 
parameters (which are equivalent to variances). The difference between the algorithm is that in 
the GaBP algorithm, a node computes the product rule and the integral, and sends the result to 
its neighbor. In the Min-Sum algorithm, a node computes the product rule, sends the intermedi- 
ate result, and the receiving node computes the integral. In other words, the same computation 
is performed but on different locations. In the Min-Sum algorithm terminology, the messages are 
linear and quadratic parameters vs. Gaussians in our terminology. 
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12.1 Equivalence of Weiss and Johnson Formulations 

One of the confusing aspects of learning GaBP is that each paper is using its own model as well 
as its own notations. For completeness, we show equivalence of notations in Weiss' vs. Johnsons 
papers. In [128] it is shown that one of the possible ways of converting the information form to 
pairwise potentials form is when the inverse covariance matrices used are of the type: 



J^j 
Jji 



where the terms Jij = Jji are the entries of the inverse covariance matrix J in the information 
form. First we list the equivalent notations in the two papers: 



Weiss 


Johnson 


comments 


Po 


Ji\:j 


precision of ^paixi) llx,eN{x,)\x, ^kiixi) 




hi\j 


mean of i^a^Xi) Ux,eNix,)\x, ^Uxi) 


P 




precision message from i to j 






mean message from i to j 




hi 


prior mean of node i 


P 

^ 11 


J a 


prior precision of node i 


Pr 


{p^ir' 


posterior precision of node i 




fj'i 


posterior mean of node i 


b 


Jji 


covariance of nodes i and j 


b' 


Jij 


covariance of nodes i and j 


a 





variance of node i in the pairwise covariance matrix Vij 


c 





variance of node j in the pairwise covariance matrix Vij 



Using this fact we can derive again the BP equations for the scalar case (above are the same 
equation using Weiss' notation). 
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fJ'ki 



Po 



^hk^i , Ji\j — Jii + AJk^i, 



'Ah 



k&N{i)\j 

b MO 

Jji • ^Ji-^j 




— Jji (0 + Ji\j) ^ Jij . 



Finally: 



hi — ha + Ahk^i , Ji — Ja + AJ^^i, 

feGAf(i) A;6Af(i) 



(12.1) 



(12.2) 



12.2 GaBP code in Matlab 

Latest code appears on the web on: [77]. 
12.2.1 The file gabp.m 

°/„ Implementation of the Gaussian BP algorithm, as given in: 
°/„ Linear Detection via Belief Propagation 

°/o By Danny Bickson, Danny Dolev, Ori Shental, Paul H. Siegel and Jack K. Wolf. 

"/o In the 45th Annual Allerton Conference on Communication, Control and Computing, 

y, Allerton House, Illinois, Sept. 07' 

% 

I 

7o Written by Danny Bickson. 
% updated: 24-Aug-2008 
7o 

°/o input : A - square matrix nxn 
°/o b - vector nxl 

°/o max_iter - maximal number of iterations 
°/o epsilon - convergence threshold 

°/o output: X - vector of size nxl, which is the solution to linear systems 

of equations A x = b 
7o Pf - vector of size nxl, which is an approximation to the main 

°/o diagonal of inv(A) 

function [x,Pf] = gabp(A, b, max_iter, epsilon) 

7o Stage 1 - initialize 
P = diag(diag(A)) ; 
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U = diag(b./diag(A)); 
n = length (A) ; 

% Stage 2 - iterate 
for 1=1 :niax_iter 

7o record last round messages for convergence detection 
old_U = U; 

for i=l:n 
for j=l:n 

% Compute P i\j - line 2 
if (i~=j && A(i,j) ~= 0) 

p_i_minus_j = sum(P(:,i)) - P(j,i); °/, 
assert (p_i_minus_j ~= 0); 
/oiterate - line 3 

P(i,j) = -A(i,j) * A(j,i) / p_i_minus_j ; 
°/o Compute U i\j - line 2 

h_i_minus_j = (sum(P( : , i) . *U( : , i) ) - P(j ,i)*U(j ,i)) / p_i_minus_j ; 
'/oiterate - line 3 

U(i,j) = - A(i,j) * h_i_minus_j / P(i,j); 

end 
end 

end 



7„ Stage 3 - convergence detection 

if (sum(sum((U - old_U)."2)) < epsilon) 

disp(['GABP converged in round ', num2str (1)] ) ; 

break ; 

end 

end °/o iterate 

°/o Stage 4 - infer 
Pf = zeros (1 ,n) ; 
X = zeros (1 ,n) ; 
for i = l:n 

Pf(i) = sum(P(:,i)); 

x(i) = sum(U(: ,i) .*P(: ,i)) ./Pf (i); 

end 



end 
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12.2.2 The file run_gabp.m 

°/„ example for running the gabp algorithm, for computing the solution to Ax = b 
°/o Written by Danny Bickson 

7o Initialize 

'/format long; 

n = 3; A = [1 0.3 0.1;0.3 1 0.1;0.1 0.1 1]; 
b = [1 1 1] ' ; 
X = inv(A)*b; 
max_iter = 20; 
epsilon = 0.000001; 

[xl, p] = gabp(A, b, max_iter, epsilon); 
dispCx computed by gabp is: 



xl 



disp( 'x 



computed by matrix inversion is 



); 



X 



disp( Miag(inv(A) ) computed by gabp is: (this is an 
approximation ! ) ') ; 



P 



disp( Miag(inv(A) ) computed by matrix inverse is: 



diag(inv(A) ) 
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